Zymo-Research / figaro

An efficient and objective tool for optimizing microbiome rRNA gene trimming parameters
GNU General Public License v3.0
78 stars 24 forks source link

Do we have to use "synced" R1 and R2 files? #13

Closed vrbacky closed 3 years ago

vrbacky commented 4 years ago

As I've found out, all R1 sequences must be of the same length. R2 too. Is it necessary to have sequences in R2 files synchonized with sequences in R1 file (same number and order)? If I trim all R1 files to e.g. 248 bases and R2 files to 245 bases using Trimmomatic, different sequnces will be removed from R1 and R2 files. Thanks.

michael-weinstein commented 4 years ago

Your R1 reads and R2 reads should be of consistent length (all your R1s should be of one length and all your R2s should be of one length, but R1s and R2s don't need to be of the same length). This is because FIGARO is meant to run on untrimmed data. If it is run on data that has already been trimmed, especially using a smart, quality-aware trimmer, the error modeling for the reads will be changed from what was originally intended. I saw from your profile that you are also using DADA2's trimAndFilter functionality, which should remove any need for Trimmomatic before you start the standard DADA2 pipeline. FIGARO will also help you figure out some more appropriate maxEE values to set for your own data. As for the R1 and R2 files being out of sync, I never tested against that condition. I expect it will throw an error (as it should) because you generally want to avoid letting your FASTQ files get into that state. From your read-lengths above, am I right in guessing that you're running V1V2 or V3V4 sequencing on a miSeq?

ramay commented 4 years ago

Hi, I have similar data for V4 region and ~250 bp reads. We use a protocol where primers are not sequenced (Schloss protocol: https://github.com/SchlossLab/MiSeq_WetLab_SOP/blob/master/MiSeq_WetLab_SOP.md ). The fastq files off the MiSeq have slight variation in length (see seqkit results below) . Also as we don't have primer sequences in these reads there is no way in figaro to say that primers are of length 0.

I have seen this protocol being used more often now.

Can you please suggest how can I use figaro on such data? Thanks! Hena

file                         format  type  num_seqs      sum_len  min_len  avg_len  max_len
10_S10_L001_R1_001.fastq.gz  FASTQ   DNA     45,087   11,315,794      249      251      251
11_S11_L001_R1_001.fastq.gz  FASTQ   DNA     24,190    6,071,249      248      251      251
12_S12_L001_R1_001.fastq.gz  FASTQ   DNA     44,538   11,178,560      247      251      251
13_S13_L001_R1_001.fastq.gz  FASTQ   DNA        133       33,379      250      251      251
14_S14_L001_R1_001.fastq.gz  FASTQ   DNA      1,944      487,869      249      251      251
15_S15_L001_R1_001.fastq.gz  FASTQ   DNA     11,203    2,811,851      248      251      251
16_S16_L001_R1_001.fastq.gz  FASTQ   DNA     57,818   14,511,642      248      251      251
17_S17_L001_R1_001.fastq.gz  FASTQ   DNA    444,893  111,663,646      248      251      251
18_S18_L001_R1_001.fastq.gz  FASTQ   DNA     99,411   24,951,543      248      251      251
19_S19_L001_R1_001.fastq.gz  FASTQ   DNA    233,555   58,620,611      248      251      251
1_S1_L001_R1_001.fastq.gz    FASTQ   DNA    354,091   88,840,649      248    250.9      251
vrbacky commented 4 years ago

I use Trimmomatic in Paired-end mode to trim R1 and R2 to the same length using "CROP:250 MINLEN:250" parameters. It cuts the last low-quality base of full-length products and drops all more than one base shorter than expected It keeps R1 and R2 synced. I lose about acceptable 3% of all reads in this step. Try it with your data. The first several bases are usually of lower quality in Illumina sequencing data so I cut them anyway (they are primers in our protocol). I'd check quality (e.g. by fastqc) of all sequencing files and remove those low-quality bases (using crop length as "primers" in FIGARO), if I was in your shoes.

ramay commented 4 years ago

Thanks for the suggestion! I used cutadapt usually so will check if there is a way to do it.

danote commented 4 years ago

This has been an interesting thread and it echoes some of my questions. I really appreciate what this program does, but I think a few clarifications about workflow specifics could be useful.

A better understanding of the input parameters would be helpful for me. The necessity of the expected length of the target amplicon (biological bases only), seems clear enough to me; it should be essential for getting paired reads to merge. However, I don’t understand what Figaro uses the forward and reverse primer lengths for. Is Figaro expecting that the forward and reverse primers are present in the front of the forward and reverse reads, respectively, and that the user will trim them from the front of the forward and reverse reads, respectively, i.e. with the DADA2::filterAndTrim() trimLeft parameter?

If so, and as noted in the above comments, what would one do in the case in which the primers were not sequenced? I suppose with the current setup of the program (since 0 is not currently accepted for primer length), the best thing to do would be to set the primer lengths to the minimum allowed value (i.e. 1) , but that does not seem ideal.

Please correct me if my understanding of the program’s functioning is wrong.

michael-weinstein commented 4 years ago

I am working on having this accept a zero primer length. When I wrote it, I wasn't seeing anybody running pipelines where the primers are entirely left out of the reads.

The reason it asks for the primer lengths is it wants to model the expected error accumulation over the length of the read as DADA2 would see it, but the primers are left trimmed and need to be excluded. Your guess as to the reasoning was spot-on.

I am working on a release that will allow for primer lengths of zero, but that is holding on a project I'll be releasing later this week (I hope) that will be immediately useful for COVID-19.

michael-weinstein commented 4 years ago

Just curiosity, what does everybody here study? I'm thinking of putting together a series of microbiome informatics webinars very soon.

danote commented 4 years ago

@michael-weinstein , thanks for the reply and your work towards a new Figaro release. I hope the covid-19-related project is successful.

As a general comment on amplicon workflows, I have used the DADA2::filterAndTrim() trimLeft parameter to trim primers before, and it works fine. However, I prefer using cutadapt to trim primers because is more powerful (in its handling of insertions/deletions, among other features). Cutadapt helps to ensure that there are no primers left on the reads. Thus, I prefer to do all primer trimming (but no other “quality-based” trimming) before working with the sequences with DADA2. So in this case, I would use zero as my primer length arguments for Figaro, regardless of whether the primers were sequenced originally. What are your thoughts on that workflow regarding the robustness of the Figaro algorithm?

I am doing my PhD in a “geomicrobiology” project on H2-consuming microbes in an aquifer where water-rock reaction produces abundant abiotic H2.