Merging FASTQ files in paired-end sequencing?

RockefellerUniversity / Bioconductor_Introduction

Bioconductor Training

https://rockefelleruniversity.github.io/Bioconductor_Introduction/

GNU General Public License v3.0

4 stars 1 forks source link

Merging FASTQ files in paired-end sequencing? #16

Open zapaterc opened 3 years ago

zapaterc commented 3 years ago

Hi Thomas and Matt,

I'm building some code to do QC in one RNAseq project that I'm working on... I have 2 fastq.gz files per sample and I just realized this is because there's one fastq.gz file per each paired-end sequencing run.

I was wondering if there's a way (a set of functions?) to integrate both files together so that I end up with just one fastq.gz per sample (so that I can also use that one file to align my reads)? Or which way would you recommend in order to move forward?

Thanks a lot!

Carolina

P.S.: @matthew-paul-2006, I wonder if the reason why my QC looked "clean" at the ends was because I did the analysis with run number 2 of the paired-end sequencing??

ThomasCarroll commented 3 years ago

hi Caroline,

To concatenate two fastqs samples you could use the Rfastp package's catfastq function. @cafeblue wrote this so can help if you have any issues. http://www.bioconductor.org/packages/release/bioc/manuals/Rfastp/man/Rfastp.pdf

Just to be clear on when to merge,

If you have paired-end data you should typically have at least one read1 and one read2 fastq file.

These represent either end of the fragment and so are used in alignment as separate files (not merged). Typically denoted by R1 and R2 in name of file i.e. NAMEofSampleR1.fq and NAMEofSampleR2.fq

If you have multiples of R1 and R2, you can merge the R1 files together and then R2 files together prior to alignment as long as you ensure files being merged for R1 and R2 are in the same order i.e Sequencing Run1, Run2, Run3 fastqs are concatenated in that order for both the final merged R1 and R2 files.

tom

zapaterc commented 3 years ago

Hi Tom,

This was very helpful, thank you! I do have just one R1 and R2 file per sample. I just did the alignment but I'm still curious: When/How do I merge the information for the aligned data? There has to be a moment in the process where we merge those two runs, right? I might have missed something in the lecture.

Also, while using the rfastp() function in order to do some basic QC I observe the following:

If I only input read1 the % of duplication rate is about 67%.
If I input both read1 and read2 I get about 6%.

I'm not sure I understand why the duplication rate is overestimated when you make the function think that the data comes from single-end sequencing.

As a follow-up question: If i use the readFastq() function followed by srduplicated() in order to get the duplicate reads in R1 I obtain approximately a 82% of duplicate reads.

Screen Shot 2021-07-21 at 02 18 03

Sorry for the lengthy questions and thanks a lot in advance!

Carolina

matthew-paul-2006 commented 3 years ago

We talk about paired end data in subsequent sessions on technique specific analysis, so you haven't missed anything. So the question of merging all depends on the experimental approach, but for most cases you would do this at the alignment step. For RNAseq paired end, you can provide the aligner with both fastqs and it will align considering both reads. You can check the function help page to get the exact input variables and order you need to supply the file names.

As for the QC plots you showed me, it is unlikely that they would be gated in that way, based on just providing the fastq read2, instead of whole dataset. In fact it is often the case that the second read has slightly lower quality. Based on what you said the data had already been trimmed, and it fits with that explanation (especially as you said the read length varied a lot), but your best bet is to just check in with the folks who did the initial processing for you.