gamcil / clinker

Gene cluster comparison figure generator
MIT License
507 stars 66 forks source link

protein groups and similarity #90

Closed galacmr closed 1 year ago

galacmr commented 1 year ago

I am struggling to understand how clinker decides to group proteins and would appreciate any clarification.

I used cblaster to search with a cassette I am interested in then used clinker to make the image of the similar cassettes plus the surrounding genes. According to the cblaster search, the proteins within the found cassettes should have a high degree of similarity to the search cassette but that is not what clinker is drawing. See 05_E-Cluster1348 below. Screen Shot 2022-10-14 at 11 05 34 AM The cblaster results say that the protein in group 0 should have 80% identity in 05_E-Cluster1348 to the search cassette group 0 protein but its not even colored as though it should be in the same group. The other proteins in the 05_E-Cluster1348 found cassette range from 75-50% identity to the search cassette in the cblaster results but they are also not assigned to a group by clinker.

I looked at the similarity matrix that clinker makes and I don't see anything there that explains this. rep_seqs_most_abundant_similarity_matrix.xlsx

How is clinker grouping the proteins? Why aren't the proteins in 05_E-Cluster1348 showing similarity to the search cassette? I am running it with the default identity threshold of 0.3 . I tried lowering it to 0.2 but only got an increase in secondary connections. Thank you for any insight you can provide.

galacmr commented 1 year ago

I figured out what was causing the problem and it is not on clinker but rather with cblaster. It looks like the extract_clusters module within cblaster had the annotation off by a single bp on the 3' end of the CDS in the gbk files which resulted in the translations being messed up for some of the genes as they didn't start in the correct place.

gamcil commented 1 year ago

Ah thank you for the information @galacmr, will fix the issue in cblaster. The grouping is just done purely by protein similarity, so a bunch of wrong translations would definitely mess it up.