Closed Shuyib closed 1 year ago
I'll start with this one.
I have done something that could help us move to doing the clustering process/classification/search. Referencing
https://github.com/Shuyib/Phylogenetic-tree-study/blob/master/tokenize_compare_sequences.py. For character level comparison.
[Kmer level]
0.0 - 0.2 : low similarity 0.2 - 0.4 : moderate similarity 0.4 - 0.6 : high similarity 0.6 - 0.8 : very high similarity 0.8 - 1.0 : identical sequences
a | accession_id | label_encoded | cosine_similarity_average |
---|---|---|---|
0 | KY053297.1 | 20 | 0.484… |
1 | ON808422.1 | 29 | 0.487… |
2 | FM207545.1 | 5 | 0.510… |
3 | KM281506.1 | 12 | 0.613… |
4 | MK909923.1 | 24 | 0.483… |
5 | LT599799.2 | 23 | 0.575… |
6 | KY053299.1 | 21 | 0.487… |
7 | KM281495.1 | 7 | 0.610… |
8 | FM207520.1 | 4 | 0.528… |
9 | KM281503.1 | 10 | 0.575… |
10 | KU843843.1 | 15 | 0.597… |
11 | KU843844.1 | 16 | 0.594… |
12 | KT378441.1 | 13 | 0.578… |
13 | KM281505.1 | 11 | 0.614… |
14 | KM281498.1 | 9 | 0.588… |
15 | HF564650.1 | 6 | 0.559… |
16 | MK909924.1 | 25 | 0.549… |
17 | MW644614.1 | 27 | 0.480… |
18 | MZ749734.1 | 28 | 0.574… |
19 | KU843866.1 | 19 | 0.595… |
20 | AY062899.1 | 3 | 0.567… |
21 | LN871587.1 | 22 | 0.529… |
22 | AF177667.1 | 1 | 0.563… |
23 | AF177666.1 | 0 | 0.561… |
24 | KU843845.1 | 17 | 0.573… |
25 | KU843850.1 | 18 | 0.575… |
26 | MK909925.1 | 26 | 0.550… |
27 | KU843841.1 | 14 | 0.568… |
28 | AY062898.1 | 2 | 0.568… |
29 | KM281496.1 | 8 | 0.613… |
@bonfaceonyango might have some ideas. I will have him look at it.
Hi everyone. I have been working on alignmet free phylogenetics tree construction where I am exporing the use of D2star dissimilarity metrics, which incoportates normalized cosine similarity. In this metrics given two DNA sequences, first it extracts kmers from each sequence , count each kmer frequencies per two sequnces, this generates a dot product used to calculate D2star metrics. This results into distance matrix table that can be used to construct phylogenetics tree using Neighbour-Joining (NJ) method or UPGMA. While I used this metric using sequences of different lengths, I belive padding the sequences could attribute to increased accuracy as mentioned by @Shuyib in the previous comment. Use of kmers and dissimilarity/simmilarity metrics could be more robust rather than full sequence in terms of the algorithm performance and complexity by reducing the dimentionality of the sequences.
I have gone through the cosine scores and really great to see you have come up with these. Since you have generated the cosine similarity categories, this is a significant step to start . I'll explore to work on this using the cosine similarity, kmers followed by clustering and probably later on we can look into applying other metrics.
Okay, @bonfaceonyango. Kmer level cosine similarity is also available in the updated code. Its a matter of just running the code. Otherwise, I appreciate all the ideas you have. But, I am taking a break for the next week. When I am back, I will give better feedback.
That is great @Shuyib. Let me try to explore it and see what we can come out with from it. We shall keep in touch once you are back.
Using similarity metrics for example cosine similarity with padding of the sequence to ensure that they are of the same length and apply cosine similiarity to the sequences in the Kmer columns and or the full faster sequence.
We can make a phylogenetic tree using Hierarchical clustering or make a tree diagram with other methods. The importance of this is. Similar sequences means we can use the same therapies e.g antibiotics for the microbes or gene therapies on the microbes.