Closed vrbacky closed 3 years ago
Your R1 reads and R2 reads should be of consistent length (all your R1s should be of one length and all your R2s should be of one length, but R1s and R2s don't need to be of the same length). This is because FIGARO is meant to run on untrimmed data. If it is run on data that has already been trimmed, especially using a smart, quality-aware trimmer, the error modeling for the reads will be changed from what was originally intended. I saw from your profile that you are also using DADA2's trimAndFilter functionality, which should remove any need for Trimmomatic before you start the standard DADA2 pipeline. FIGARO will also help you figure out some more appropriate maxEE values to set for your own data. As for the R1 and R2 files being out of sync, I never tested against that condition. I expect it will throw an error (as it should) because you generally want to avoid letting your FASTQ files get into that state. From your read-lengths above, am I right in guessing that you're running V1V2 or V3V4 sequencing on a miSeq?
Hi, I have similar data for V4 region and ~250 bp reads. We use a protocol where primers are not sequenced (Schloss protocol: https://github.com/SchlossLab/MiSeq_WetLab_SOP/blob/master/MiSeq_WetLab_SOP.md ). The fastq files off the MiSeq have slight variation in length (see seqkit results below) . Also as we don't have primer sequences in these reads there is no way in figaro to say that primers are of length 0.
I have seen this protocol being used more often now.
Can you please suggest how can I use figaro on such data? Thanks! Hena
file format type num_seqs sum_len min_len avg_len max_len
10_S10_L001_R1_001.fastq.gz FASTQ DNA 45,087 11,315,794 249 251 251
11_S11_L001_R1_001.fastq.gz FASTQ DNA 24,190 6,071,249 248 251 251
12_S12_L001_R1_001.fastq.gz FASTQ DNA 44,538 11,178,560 247 251 251
13_S13_L001_R1_001.fastq.gz FASTQ DNA 133 33,379 250 251 251
14_S14_L001_R1_001.fastq.gz FASTQ DNA 1,944 487,869 249 251 251
15_S15_L001_R1_001.fastq.gz FASTQ DNA 11,203 2,811,851 248 251 251
16_S16_L001_R1_001.fastq.gz FASTQ DNA 57,818 14,511,642 248 251 251
17_S17_L001_R1_001.fastq.gz FASTQ DNA 444,893 111,663,646 248 251 251
18_S18_L001_R1_001.fastq.gz FASTQ DNA 99,411 24,951,543 248 251 251
19_S19_L001_R1_001.fastq.gz FASTQ DNA 233,555 58,620,611 248 251 251
1_S1_L001_R1_001.fastq.gz FASTQ DNA 354,091 88,840,649 248 250.9 251
I use Trimmomatic in Paired-end mode to trim R1 and R2 to the same length using "CROP:250 MINLEN:250" parameters. It cuts the last low-quality base of full-length products and drops all more than one base shorter than expected It keeps R1 and R2 synced. I lose about acceptable 3% of all reads in this step. Try it with your data. The first several bases are usually of lower quality in Illumina sequencing data so I cut them anyway (they are primers in our protocol). I'd check quality (e.g. by fastqc) of all sequencing files and remove those low-quality bases (using crop length as "primers" in FIGARO), if I was in your shoes.
Thanks for the suggestion! I used cutadapt usually so will check if there is a way to do it.
This has been an interesting thread and it echoes some of my questions. I really appreciate what this program does, but I think a few clarifications about workflow specifics could be useful.
A better understanding of the input parameters would be helpful for me. The necessity of the expected length of the target amplicon (biological bases only), seems clear enough to me; it should be essential for getting paired reads to merge. However, I don’t understand what Figaro uses the forward and reverse primer lengths for. Is Figaro expecting that the forward and reverse primers are present in the front of the forward and reverse reads, respectively, and that the user will trim them from the front of the forward and reverse reads, respectively, i.e. with the DADA2::filterAndTrim() trimLeft parameter?
If so, and as noted in the above comments, what would one do in the case in which the primers were not sequenced? I suppose with the current setup of the program (since 0 is not currently accepted for primer length), the best thing to do would be to set the primer lengths to the minimum allowed value (i.e. 1) , but that does not seem ideal.
Please correct me if my understanding of the program’s functioning is wrong.
I am working on having this accept a zero primer length. When I wrote it, I wasn't seeing anybody running pipelines where the primers are entirely left out of the reads.
The reason it asks for the primer lengths is it wants to model the expected error accumulation over the length of the read as DADA2 would see it, but the primers are left trimmed and need to be excluded. Your guess as to the reasoning was spot-on.
I am working on a release that will allow for primer lengths of zero, but that is holding on a project I'll be releasing later this week (I hope) that will be immediately useful for COVID-19.
Just curiosity, what does everybody here study? I'm thinking of putting together a series of microbiome informatics webinars very soon.
@michael-weinstein , thanks for the reply and your work towards a new Figaro release. I hope the covid-19-related project is successful.
As a general comment on amplicon workflows, I have used the DADA2::filterAndTrim() trimLeft parameter to trim primers before, and it works fine. However, I prefer using cutadapt to trim primers because is more powerful (in its handling of insertions/deletions, among other features). Cutadapt helps to ensure that there are no primers left on the reads. Thus, I prefer to do all primer trimming (but no other “quality-based” trimming) before working with the sequences with DADA2. So in this case, I would use zero as my primer length arguments for Figaro, regardless of whether the primers were sequenced originally. What are your thoughts on that workflow regarding the robustness of the Figaro algorithm?
I am doing my PhD in a “geomicrobiology” project on H2-consuming microbes in an aquifer where water-rock reaction produces abundant abiotic H2.
As I've found out, all R1 sequences must be of the same length. R2 too. Is it necessary to have sequences in R2 files synchonized with sequences in R1 file (same number and order)? If I trim all R1 files to e.g. 248 bases and R2 files to 245 bases using Trimmomatic, different sequnces will be removed from R1 and R2 files. Thanks.