Currently, sencha translate uses a simple match of whether 100% of the k-mers from the reading frame, match the reference proteome. But a "Franken k-mer" situation can happen where the reading frame has 100% match, but the k-mers are all from different genes. Using sourmash search --containment would only search for "consecutive" k-mers that all appear in a single gene (or maybe family of genes??) and would be an improvement over the current method.
Maybe use sourmash lca search --containment by hacking the Least Common Ancestor (LCA) code to use gene/protein families instead of individual genes, so that the matches could be on a per-gene family level.
Currently,
sencha translate
uses a simple match of whether 100% of the k-mers from the reading frame, match the reference proteome. But a "Franken k-mer" situation can happen where the reading frame has 100% match, but the k-mers are all from different genes. Usingsourmash search --containment
would only search for "consecutive" k-mers that all appear in a single gene (or maybe family of genes??) and would be an improvement over the current method.Thanks to @bluegenes for the idea!