CCB-SB / plsdb

PLSDB pipeline to collect bacterial plasmids from NCBI
https://ccb-microbe.cs.uni-saarland.de/plsdb/
35 stars 4 forks source link

Reasonable parameters of Mash screen in containment analysis #2

Closed liangjinsong closed 5 years ago

liangjinsong commented 5 years ago

Excited to see the PLSDB you developed, because it helps with what I an doing. I want to identify the plasmid contigs in metagenomic assemblies using the Mash screen command in my local server. I feel that the default parameter of Min. identity (0.99) in Mash screen is much higher than that used in other similarity analysis. I ask for help on two issues below.

  1. How to select appropriate parameters of Mash screen for the plasmids containment analysis. For example, a lower Min. identity (such as 0.95) is qualified?
  2. From the output file of Mash screen, one can't see which sequence in the input file is highly similar to sequences in PLSDB. How to supplement the information of input sequence in the output file of Mash scree?

Thanks~

VGalata commented 5 years ago

Dear @liangjinsong,

I am happy to hear that you find our resource useful! :)

Regarding your questions:

  1. Which parameters are appropriate depends on your input. If the reference sequences are contained in your sample and you want only high-confidence hits, then having a very high identity and low p-value thresholds would work. However, if your sample contains only somewhat similar sequences, strict thresholds may not give you any hits at all and you would need to change their value. So, there is no other way than to try different thresholds and inspect the results to figure out the parameters suitable for your analysis.
  2. mash screen does not provide this information. The containment analysis is done by processing the input data as a whole, i.e. the presence of k-mers is checked for the complete sample and not for each input sequence separately. If you have long read data or contigs I would suggest that you download the PLSDB data, and perform a local analysis using mash sketch -i and mash dist. The -i option will allow you to to create sketches for each input sequence and you can then compare each of them to the reference sequences. I plan to add this option to PLSDB in the next update; until then, this kind of analysis has to be done locally.
liangjinsong commented 5 years ago

Dear VGalata, Thank you very much for your detailed response! My questions are now completely solved. Thanks again~