aitgon / vtam

MIT License
3 stars 3 forks source link

Refine taxassign algo #22

Open meglecz opened 3 years ago

meglecz commented 3 years ago

At present, all sequences in the reference database are used if they are among the best hits, irrespective of the resolution of their taxon. Some are assigned to a species level, others to a higher level. This can reduce the taxonomic resolution: For example if we have 2 hits at 97% identity, where 1 reference sequence is identified to the species, but the other only to the family, the variant will be assigned to the family.

I suggest that the users should be able to set the minimum resolution of the reference sequences for each %identity. It can be something like this 100% species 97% genus 95% family 90% order 85% class 80% phylum

I have already made a taxonomy file with an additional column that contains the resolution index: 8: species 7: genus 6 : family 5 : order 4 : class 3 : phylum 2 : kingdom 1 : superkingdom For other levels the index is a non-integer. e.g. 7.5 for subgenus. This simplifies greatly the selection of the reference sequences.