bioinfo-ut / PlasmidSeeker

A k-mer based program for the identification of known plasmids from whole-genome sequencing reads
BSD 3-Clause "New" or "Revised" License
35 stars 11 forks source link

multiple hits for plasmid description in NCBI #26

Open waalkes opened 2 years ago

waalkes commented 2 years ago

I love your tool. Thanks for making it.

I am using it with a collection of WGS Shigella isolates and some of the plasmid descriptions have multiple hits on NCBI. Is there a way I can figure out which of the plasmids it is? Are all of them in your collection? The plasmids in your database are binary files.

Here are the two examples: Shigella flexneri 1a strain 0228 plasmid, complete sequence (There are four of that name CP012736.1, CP012734.1, CP012733.1 and CP012732.1) Escherichia coli O104:H4 str. C227-11 plasmid, complete sequence (There are six CP011332.1 to CP011337.1)

Also, this one doesn't appear to be in NCBI at all: Xuhuaishuia manganoxidans strain DY6-4 plasmid sequence

Thanks,

Adam Waalkes Research Scientist UWMC

bioinfo-ut commented 2 years ago

You can download the FASTA file of all plasmids used to build the latest database. Here: https:/bioinfo.ut.ee/plasmidseeker/plasmidseeker_db_w20_fna_Nov-2021.tar.gz This may help to track down the actual identity of hits.

waalkes commented 2 years ago

Thanks, that helps a lot. It appears though that the numbering system between the .list files and the .fna is not consistent.

PlasmidSeeker output:

K-mers found Total kmers %Kmers found(F) Copy number P-value Plasmid ID Coverage List file

MinP 80

Estimated bacteria isolate median coverage 94

Number of tests: 16 Significant p-value (initial 0.05) with correction: 0.003125

# PLASMID CLUSTER 1 149428 166885 89.54% 1.12 0 Shigella flexneri 2002017 plasmid pSFxv_1, complete sequence 105 /mnt/disk4/labs/salipante/programs/PlasmidSeeker/db_w20/plasmid_3951.fna_20.list

yet when I look at plasmid_3951.fna it is not the same:

tron:/mnt/disk4/labs/salipante/programs/PlasmidSeeker/db_w20_fna $ head -n 1 plasmid_3951.fna

NZ_CP028268.1 Pediococcus pentosaceus strain SRCM102739 plasmid unnamed2, complete sequence

Is this by design? Both lists seem to contain the same number of plasmids so I assume I have the same db versions.

tron:/mnt/disk4/labs/salipante/programs/PlasmidSeeker $ ls -l db_w20/.list | wc -l 19782 tron:/mnt/disk4/labs/salipante/programs/PlasmidSeeker $ ls -l db_w20_fna/.fna | wc -l 19782 tron:/mnt/disk4/labs/salipante/programs/PlasmidSeeker $