Shared hashes - Githubissues

elianapc commented 2 years ago

Hello Is there a threshold to the shared hashes, I know 1000 is the highest but is there a threshold to choose or discard a match according to the hashes?

Thanks!

SmalJonni commented 2 years ago

Hello,

I assume you refer to the search by sequence function with mash.

Sadly there is no threshold for the sharing of hashes where we distinguish match from no match.

As described in the original publication of Mash, the p-value may be used to gauge whether the number of shared hashes is better compared to random sequences. The threshold can then be moved to be more sensitive/ specific. In practice, the number of plasmids identified with low p-values could be high because you may already select before analysis for plasmids and we often encounter similar subsequences, resulting in multiple hits. One possible solution would be to look at the proposed candidates and proceed with less sensitive methods. For example, you could look into blast alignments.

Regards

elianapc commented 2 years ago

I refer to the colum "shared hashes" that I get when use PLSDB to find plasmids in a genome. I do not understand the meaning of hashes, why can I get high "shared_hashes" with low identity percentage or vice versa. which would be the best matches? Should I consider only the number of hashes or the identity? Thanks!

VGalata commented 2 years ago

Hi @elianapc,

The identity is estimated from the fraction of shared k-mers and you should have lower identity values for a lower number of shared hashes. But, you have to keep in mind that some plasmids are rather small and their mash sketch might have less than 1000 k-mers. If all those k-mers are found in the submitted FASTA, then the reported identity would be 1.0 but the number of shared hashes will be below the general sketch size (i.e. 1000). I admit that this might be confusing for the user.

@SmalJonni Maybe the output could also include the sketch size of the plasmids? It is contained in the original mash screen output (e.g., 734/1000), and one could parse and save it into a separate column.

Edit: For results filtering, you could use the estimated identity and the p-value. However, you should be aware that mash screen is not a metagenomic profiler and its output might be highly redundant (see also the last paragraph in the mash screen article). But it should give you an idea which plasmids from the database might be present in your sample.

Best, Valentina

CCB-SB / plsdb

Shared hashes #17