Use similarity metrics - Githubissues

Shuyib commented 1 year ago

Using similarity metrics for example cosine similarity with padding of the sequence to ensure that they are of the same length and apply cosine similiarity to the sequences in the Kmer columns and or the full faster sequence.

We can make a phylogenetic tree using Hierarchical clustering or make a tree diagram with other methods. The importance of this is. Similar sequences means we can use the same therapies e.g antibiotics for the microbes or gene therapies on the microbes.

Shuyib commented 1 year ago

I'll start with this one.

Shuyib commented 1 year ago

I have done something that could help us move to doing the clustering process/classification/search. Referencing

https://github.com/Shuyib/Phylogenetic-tree-study/blob/master/tokenize_compare_sequences.py. For character level comparison.

[Kmer level]

how to interpret the cosine similarity scores

0.0 - 0.2 : low similarity 0.2 - 0.4 : moderate similarity 0.4 - 0.6 : high similarity 0.6 - 0.8 : very high similarity 0.8 - 1.0 : identical sequences

a	accession_id	label_encoded	cosine_similarity_average
0	KY053297.1	20	0.484…
1	ON808422.1	29	0.487…
2	FM207545.1	5	0.510…
3	KM281506.1	12	0.613…
4	MK909923.1	24	0.483…
5	LT599799.2	23	0.575…
6	KY053299.1	21	0.487…
7	KM281495.1	7	0.610…
8	FM207520.1	4	0.528…
9	KM281503.1	10	0.575…
10	KU843843.1	15	0.597…
11	KU843844.1	16	0.594…
12	KT378441.1	13	0.578…
13	KM281505.1	11	0.614…
14	KM281498.1	9	0.588…
15	HF564650.1	6	0.559…
16	MK909924.1	25	0.549…
17	MW644614.1	27	0.480…
18	MZ749734.1	28	0.574…
19	KU843866.1	19	0.595…
20	AY062899.1	3	0.567…
21	LN871587.1	22	0.529…
22	AF177667.1	1	0.563…
23	AF177666.1	0	0.561…
24	KU843845.1	17	0.573…
25	KU843850.1	18	0.575…
26	MK909925.1	26	0.550…
27	KU843841.1	14	0.568…
28	AY062898.1	2	0.568…
29	KM281496.1	8	0.613…

kipkurui commented 1 year ago

@bonfaceonyango might have some ideas. I will have him look at it.

bonfaceonyango commented 1 year ago

Hi everyone. I have been working on alignmet free phylogenetics tree construction where I am exporing the use of D2star dissimilarity metrics, which incoportates normalized cosine similarity. In this metrics given two DNA sequences, first it extracts kmers from each sequence , count each kmer frequencies per two sequnces, this generates a dot product used to calculate D2star metrics. This results into distance matrix table that can be used to construct phylogenetics tree using Neighbour-Joining (NJ) method or UPGMA. While I used this metric using sequences of different lengths, I belive padding the sequences could attribute to increased accuracy as mentioned by @Shuyib in the previous comment. Use of kmers and dissimilarity/simmilarity metrics could be more robust rather than full sequence in terms of the algorithm performance and complexity by reducing the dimentionality of the sequences.

bonfaceonyango commented 1 year ago

I have gone through the cosine scores and really great to see you have come up with these. Since you have generated the cosine similarity categories, this is a significant step to start . I'll explore to work on this using the cosine similarity, kmers followed by clustering and probably later on we can look into applying other metrics.

Shuyib commented 1 year ago

Okay, @bonfaceonyango. Kmer level cosine similarity is also available in the updated code. Its a matter of just running the code. Otherwise, I appreciate all the ideas you have. But, I am taking a break for the next week. When I am back, I will give better feedback.

bonfaceonyango commented 1 year ago

That is great @Shuyib. Let me try to explore it and see what we can come out with from it. We shall keep in touch once you are back.

Shuyib / Phylogenetic-tree-study

Use similarity metrics #54

how to interpret the cosine similarity scores