czbiohub-sf / orpheum

Orpheum (Previously called and published under sencha) is a Python package for directly translating RNA-seq reads into coding protein sequence.
MIT License
18 stars 4 forks source link

Consider using "sourmash search --containment" for "translate" #86

Open olgabot opened 4 years ago

olgabot commented 4 years ago

Currently, sencha translate uses a simple match of whether 100% of the k-mers from the reading frame, match the reference proteome. But a "Franken k-mer" situation can happen where the reading frame has 100% match, but the k-mers are all from different genes. Using sourmash search --containment would only search for "consecutive" k-mers that all appear in a single gene (or maybe family of genes??) and would be an improvement over the current method.

image

Thanks to @bluegenes for the idea!

olgabot commented 4 years ago

Maybe use sourmash lca search --containment by hacking the Least Common Ancestor (LCA) code to use gene/protein families instead of individual genes, so that the matches could be on a per-gene family level.