Nucleomics-VIB / InSilico_PCR

extract long read subsequences from a pair of primers
5 stars 1 forks source link

Questions about aims, read length and perc. unclassified #1

Closed MaestSi closed 4 years ago

MaestSi commented 4 years ago

Hi Stephane, congrats for the effort. I would like to ask you some questions: 1- Is the ultimate aim of this pipeline the benchmarking of metabarcoding taxonomic classifiers starting from metagenomics data, or would you propose this as a mean for evaluating a sample composition in an unbiased way, avoiding PCR-related and completeness of genomic databases-related issues? 2- Why are the EPI2ME average read lengths so different to the expected amplicon size? 3- Why do you think the percentage of unclassified reads is so different between the 3 different primers sets? 4- In the "Comparing the results table" legend, you write that you reported in bold the closest genus hit. Did you mean the closest species hit? In fact, I saw you reported different results for different species belonging to the same genus. It would be interesting to have a table for classification accuracy at genus level too. 5- Did you evaluate the relative abundance with respect to the total number of reads, the total number of reads passing the q-score filtering or the total number of classified reads? Thanks, Simone

splaisan commented 4 years ago

HI Simone, 1) both and the first as a self evaluation as we know what should be fond. This can also be used to extract selection of genomic regions to mimic a single PCR, a panel, exome-seq, all promoters, whatever you can think of and have the time to process. 2) the epi2me histogram is worthless and takes too much account from outlier reads, just look at the peak bar which is OK. I think this is necause BBmap does not clip both ends of all DNA fragments and sometimes, longer fragments are kept which anyway will be aligned locally or rejected during classification (I hope!) 3) each primer set likely leads to a different amount of false positive reads which are then excluded during classification. I added a plot showing classified counts which are closer between sets. 4) nice pick, table added 5) total classified, I do not in that case really care about reads that do not classify. I think that BBMap does not take into account 3' bases when matching a primer so I expect a lot of false priming with global good sequence similarity leading to extraction of non-relevant sequences. I trust the classifier to take care of these non-amplicon sequences and except for the higher rate of V4 amplicons,the other two datasets are quite convergent.

This is of course only a first case, I used a cutoff of 80% for the BBMap extraction which could be highered, I wanted to avoid loosing too much data because of the error rate of ONT. More precision can likely be obtained with a higher cutoff value there.

Thanks for your comments

MaestSi commented 4 years ago

Thank you very much for the kind and detailed explanations. Keep on with this interesting work! And sorry for the first empty comment, was just struggling typing on the phone. Simone

MaestSi commented 4 years ago

Hi Stephane, just a further comment about your calculations about point #5. I noticed that in your calculations you didn't consider the total number of classified reads, but the total number of classified reads that are assigned to a genus identified by at least 1% of reads instead. So, for example, the total in your case would be 58675 instead of 80211 for primers 27F-U1492R. Since only actually present genera are identified by >1% reads, this results in a Genera table were all columns sum up to 100%, and therefore it would make me think that reads from Escherichia (since all primers rescue little reads that are classified to Escherichia) are mis-assigned to one of the other genera in the table. Maybe you could consider specifying this and adding that 26.8% reads ((80211-58675)/80211) have unreliable classification, since supporting low-abundance genera, or evaluating the percentages with respect to the total number of classified reads. In my opinion it would provide more accurate estimates about which species/genera are easier to identify and those for whom the reads are wrongly spread across some other spurious taxa. This is just a personal opinion, of course... :)

splaisan commented 4 years ago

you have a point, I only considered the top of the table with 1% filtering. The mis-assignment is apparently something known and probably goes to one of the other top genus that is above expectation.

I am still in the process of using 16S data to use your tool but did not find the time yet. My data is not usable as it is already fastq and your tool takes fats5 right?

Thanks for your remarks and feedback, I will edit accordingly asap.

MaestSi commented 4 years ago

Actually, if you gzip them, and end up with a set of BC\<numbers>.fastq.gz files it should be possible to run it with:

source activate MetONTIIME_env
nohup ./MetONTIIME.sh <working_dir> <metadata file> <sequences qiime2 artifact> <taxonomy qiime2 artifact> <threads> &

where \<metadata file> might not exist yet, \<working_dir> is the directory containing .fastq.gz files and the two qiime2 artifacts are those generated with the Import_database.sh script. When you have time, it would be interesting to see if using the whole set of genomics reads for the EPI2ME 16S workflow leads to similar proportions of identified genera, given that the runtime is going to be much higher (probably you should consider doing some random subsampling) and the percentage of classified reads is going to be much lower. Thanks for the discussion, Simone

splaisan commented 4 years ago

I just put a support ticket on your site after failing with the command I probably did something silly or my install is not OK Best

splaisan commented 4 years ago

Hi Simone, I added data from your tool, please feel free to comment on my text and results. Thanks once more for your help. When your tool could be a bit faster, I think it make a very neat alternative to the ONT black box.

MaestSi commented 4 years ago

Thank you for the kind words. Let's hope the guys at QIIME2 are going to implement a parallelized version of Blast+, but based on this plot it seems that Blast+ is already quite fast, compared to Blast-legacy, for example. As I told you, MetONTIIME is just a wrapper of pre-existing tools, I did not make any new assumptions or efforts for optimizing what I guess has already been optimized at ONT. Still, I hope its flexibility for changing the database may be useful to someone. Moreover, it should work without any internet connection, thus enabling on-field work! My only suggestion regarding the results is that it would be nice to add a screenshot of the taxa-bar-plots.qzv, maybe at genus level, because I think that it would allow a quick glance at the estimated composition using different primers. Thanks again, Simone

splaisan commented 4 years ago

Sure Simone, excuse my ignorance, but how can I plot from taxa-bar-plots.qzv (I am not yet up with Qiime, sorry) Can you point me to a page with command examples! OK found the viewer :-)