CCB-SB / plsdb

PLSDB pipeline to collect bacterial plasmids from NCBI
https://ccb-microbe.cs.uni-saarland.de/plsdb/
35 stars 4 forks source link

Small false positive plasmids #11

Closed Mishmash-su closed 2 years ago

Mishmash-su commented 3 years ago

Hi,

I'd like to point to two examples of potentially false positive plasmid sequences included:

MT230289.1 which is 312 bp and seems to have the same name as MT230271.1 which is 54kb plasmid MT230381.1, which 821 bp and also has the same name as MT230371.1, 42kb plasmid

I assume that if I want to exclude these from future analyses, I would need to download the fasta sequences from the blast database and resketch with mash?

VGalata commented 3 years ago

Dear @Mishmash-su,

Thank you for reporting this issue!

Very short records should be removed in the next update after implementing a minimal length cutoff (see issue #8). Then, NZ_MT230289.1 would be removed. However, assuming that we set the cutoff below 800bp (as suggested in the linked issue), NZ_MT230381.1 would pass the length filtering step...

I think to avoid such cases, we would need to look at records which are from the same Biosample and have the same description. If one of them is rather short and is part of the longer one, the short one could be excluded as a putative artifact. But, we would need to evaluate this strategy first. @SmalJonni: Could you consider testing that for the next update?

@Mishmash-su: In the meantime, you can of course remove suspicious records from the downloaded data and re-create the database files you need (see issue #9 for how to extract the record sequences from provided data).

Best, Valentina

SmalJonni commented 2 years ago

The suggested inclusion check was implemented in the newest update.