MaestSi / MetONTIIME

A Meta-barcoding pipeline for analysing ONT data in QIIME2 framework
GNU General Public License v3.0
73 stars 16 forks source link

Why these default settings? #16

Closed langse62171 closed 4 years ago

langse62171 commented 4 years ago

Hi Simon, thanks for the update of the MetONTIIME script, it's great to get the feature_table now for every taxalevel! I have another short question about the pipeline. Why is 1400 bp with a window of 300 bp preset in the Config and the SQK-RAB204 kit selected at the same time? I may get something wrong, but thought the product length of the primers of the kits SQK-RAB204 and SQK-16S024 is about 1500 bp or plus barcode and adapter about 1580 bp? Or do you mean with "amplicon length" the Product without adapter, barcode and primer? But even then i still get about 1470 bp. So trimmed or untrimmed I get over 1400 bp. Best, Sebastian

MaestSi commented 4 years ago

Hi! With amplicon length I mean the PCR product after adapters and primers trimming. If you use SQK-RAB204, the PCR primers should be already part of the adapters. However, if you set primers_length <- 25 you are going to trim 50 additional bases, and end up with preprocessed reads that are about 1400bp long, like this: hist_Reads

I did not verify it, but I am pretty sure that using that kit, it would be ok to set primers_length <- 0 and, accordingly, you may set the average length to 1450. However, mind that the amplicon length may vary a little bit as shown in the attached barplot, and the aim of this step is to remove chimeric sequences, which may be twice as long, and nonspecific PCR products, which may have a completely different length.

langse62171 commented 4 years ago

Hi. That was a fast answer, thanks! :) But sorry, I still don't quite get it. I use the kit SQK-16S024, but it has the same primers as the kit SQK-RAB204.The question regarding the default settings came up because I get an average length of 1580 bp for the untrimmed, just basecalled and q-score filtered reads in my microbiom sequencings. The graphic shows one of them:

median_reads

I should have mentioned it before, but with the distinction adapter, barcode and primer I meant the three parts of the "total primer" marked as below: Forward primer: 5' - ATCGCCTACCGTGAC - barcode - AGAGTTTGATCMTGGCTCAG - 3' Reverse primer: 5' - ATCGCCTACCGTGAC - barcode - CGGTTACCTTGTTACGACTT - 3' |---------adapter---------|-barcode-|--------16S primer-------------| Source: Nanopore (https://community.nanoporetech.com/technical_documents/chemistry-technical-document/v/chtd_500_v1_revv_07jul2016/barcoding-kits).

If I now understand your answer correctly, amplicon length (1400 bp) means the PCR product without "foreign DNA", therefore without adapter, barcode and 16S primer. But even then I cannot understand the length of 1400 bp. Since if I subtract the length of the two total primers above (118 bp) from the average length of 1580 bp in my own sequencing, I get an amplicon length of 1462 bp. Even if I blast the above 16S primers with primerblast, I get an average PCR product of about 1502 bp and thus, minus the 16S primers (20 bp each), an expected amplicon length of 1462 bp. Because of this difference I now wonder which window I should choose best for my own sequencing?

Maybe it would also help if you describe the individual steps of the preprocessing in more detail one after the other. Best regards, Sebastian

MaestSi commented 4 years ago

Hi Sebastian, I did not pay too much attention in describing the details of the preprocessing in the readme, because the preprocessing is similar (a part from quality filtering, which here can be applied, and porechop, now unsupported, has been removed) to the referenced manuscript. If I have a look at the untrimmed reads length distribution, I get one which is very similar to yours: Untrimmed_reads_length Kits SQK-RAB204 and SQK-16S024 are a particular case, since they come with pre-defined PCR primers included in the "adapter". I am using a broad concept of adapter here, meaning anything that can be recognised and trimmed by guppy_barcoder when specifying --trim_barcodes. For example, if you used SQK-LSK109 kit, then adapters would not contain PCR primers. Anyway, the preprocessing steps are the following: 1- basecalling by guppy_basecaller 2- demultiplexing + adapters trimming + primers trimming by guppy_barcoder 3- length + quality filtering by NanoFilt

So, using SQK-16S024 it is ok to specify primers length to 0, since primers are already part of the adapters. I left 25 bp as the default value because I think it is better to throw away 50 bp without any valid reason than forgetting to trim PCR primers. I have updated the documentation in the config file to be more clear. I hope this answers your questions. Simone

MaestSi commented 4 years ago

So, the fast answer to your question is: it is expected that your amplicon length is about 1460 bp. The point is that if you trim additional 50 bases (2*default primers_length value) you end up with about 1400 bp. Therefore, fell free to set primers_length <- 0 and amplicon_length <- 1450: your primers should already be trimmed, as they are part of the adapter recognised by guppy_barcoder.

langse62171 commented 4 years ago

Ah ok now it's clear, thank you! I know this is offtopic, but do you may know if Porechop also trims the whole primer including the 16S-primer in this kit? This is important to know for us so we can compare our old data with the current data. Thanks again! Best, Sebastian

MaestSi commented 4 years ago

Porechop adapters file does not contain those sequences, so it doesn't. It just trims the barcode sequence + 2nt (--extra_end_trim parameter has 2 as default) and all the nt preceeding the barcode if considering the 5' end of the read, or following the barcode if considering the 3' end of the read. I just did a small test to confirm this: if you look for the full primer sequence after replacing M with A or C (I mean the actual PCR primer) you won't find it frequently, but if you look for "AGTTTGATC" after porechop, you will find it in almost half of the reads, namely almost all forward reads. This confirms that if you are using porechop you should either set --extra_end_trim to the primers length, or use afterwards another tool, as cutadapt, to trim the PCR primer sequence from your reads. Feel free to close the issue if you don't have any other questions! Best, Simone

langse62171 commented 4 years ago

Hi Simon, thanks for all the details, they help a lot! Also thanks for the check up, that's great! All the very best, Sebastian