bioinfo-ut / PlasmidSeeker

A k-mer based program for the identification of known plasmids from whole-genome sequencing reads
BSD 3-Clause "New" or "Revised" License
35 stars 11 forks source link

Plasmid Seeker interpretation? #17

Closed el-nino-007 closed 5 years ago

el-nino-007 commented 5 years ago

Hi,

I have two questions that need to be answered regarding your software.

  1. I realise that in my data, there are 2 FASTQ files and 2 filtered FASTQ files. Which two should I use as input for PlasmidSeeker? I've done with the filtered ones, but I am not quite sure that is the way to go.

  2. The results really confused me. In fact, I have run PlasmidSeeker with two of my bacterial genome sequencing FASTQ datasets (both are filtered FASTQ files) and I noticed that there are so-called "plasmid clusters" in the results file.

How can we say about the number of plasmids in my genome following these kinds of results?

mihkelvaher commented 5 years ago

Hi!

1) The 2 FASTQ files are probably the 2 runs of the same sample (r1, r2) and should indeed be given together as an input. Regarding the filtering - depends what kind of filtering was done. If only adapter, low quality etc sequences were removed it should be fine to use this. If, on the other hand, only some specific sequences were kept (16S rRNA, plasmid ori etc), the unfiltered sequences are the way to go since PlasmidSeeker assumes the whole sequence was in the sample (even if due to lower coverage some of it was not captured).

2) The plasmid clustering is used because many of the plasmids in the database are very similar to each other meaning that they share a lot or almost all of the k-mers making it hard to distinguish which of the plasmids was in the sample. Furthermore, due to biological variance the sequenced plasmid already somewhat differs from the database sequence. The percentages of found k-mers in the cluster (by which the table is sorted) might give a clue. If the first plasmid has 100% found (might be reduced due to filtering) and the next plasmids are lower, the best guess is that the first plasmid is in the sample.

The copy number column shows the number of plasmids per bacterial genome.

Note that for a single sample, PlasmidSeeker expects that there is only 1 bacterial genome present and uses this to estimate the number of plasmids. If there are multiple bacterial genomes in the sample (with different amounts) there is no way to say which plasmid is from which bacteria and therefore we don't know which of the plasmid copy number is correct (if, for example, we have 2 different bacteria in the sample).