BIMSBbioinfo / pigx_rnaseq

Bulk RNA-seq Data Processing, Quality Control, and Downstream Analysis Pipeline
GNU General Public License v3.0
20 stars 11 forks source link

Multiple fastq input files per sample? #105

Closed smoe closed 2 years ago

smoe commented 2 years ago

Hello, Then samples in front of me have been sequenced on multiple lanes - single ended:

...
SamplesHuman/12_1/Files/12-1_S8_L003_R1_001.fastq.gz
SamplesHuman/12_3/Files/12-3_S10_L003_R1_001.fastq.gz
SamplesHuman/12_3/Files/12-3_S10_L004_R1_001.fastq.gz
SamplesHuman/12_3/Files/12-3_S10_L001_R1_001.fastq.gz
SamplesHuman/12_3/Files/12-3_S10_L002_R1_001.fastq.gz
SamplesHuman/21_2/Files/21-2_S16_L004_R1_001.fastq.gz
SamplesHuman/21_2/Files/21-2_S16_L001_R1_001.fastq.gz
SamplesHuman/21_2/Files/21-2_S16_L002_R1_001.fastq.gz
SamplesHuman/21_2/Files/21-2_S16_L003_R1_001.fastq.gz
SamplesHuman/22_1/Files/22-1_S1_L001_R1_001.fastq.gz
SamplesHuman/22_1/Files/22-1_S1_L002_R1_001.fastq.gz
SamplesHuman/22_1/Files/22-1_S1_L003_R1_001.fastq.gz
SamplesHuman/22_1/Files/22-1_S1_L004_R1_001.fastq.gz
...

This is, except if there are alternative suggestions from your side. Or should these be treated as technical replicates? The RNA is poly-A enriched and every file has about 18M reads. I am tempted to just zcat these lanes sample-wise into single fastq. Or should I do both and then compare?

Many thanks!

al2na commented 2 years ago

i would treat them as technical replicates and do one round of the pipeline, if they end up being very close to each other in PCA or clustering I would concatenate them. Otherwise, there might be lane-specific effects you might need to consider as covariates.

Best, Altuna

On Tue, Oct 12, 2021 at 2:17 PM Steffen Möller @.***> wrote:

Hello, Then samples in front of me have been sequenced on multiple lanes - single ended:

... SamplesHuman/12_1/Files/12-1_S8_L003_R1_001.fastq.gz SamplesHuman/12_3/Files/12-3_S10_L003_R1_001.fastq.gz SamplesHuman/12_3/Files/12-3_S10_L004_R1_001.fastq.gz SamplesHuman/12_3/Files/12-3_S10_L001_R1_001.fastq.gz SamplesHuman/12_3/Files/12-3_S10_L002_R1_001.fastq.gz SamplesHuman/21_2/Files/21-2_S16_L004_R1_001.fastq.gz SamplesHuman/21_2/Files/21-2_S16_L001_R1_001.fastq.gz SamplesHuman/21_2/Files/21-2_S16_L002_R1_001.fastq.gz SamplesHuman/21_2/Files/21-2_S16_L003_R1_001.fastq.gz SamplesHuman/22_1/Files/22-1_S1_L001_R1_001.fastq.gz SamplesHuman/22_1/Files/22-1_S1_L002_R1_001.fastq.gz SamplesHuman/22_1/Files/22-1_S1_L003_R1_001.fastq.gz SamplesHuman/22_1/Files/22-1_S1_L004_R1_001.fastq.gz ...

This is, except if there are alternative suggestions from your side. Or should these be treated as technical replicates? The RNA is poly-A enriched and every file has about 18M reads. I am tempted to just zcat these lanes sample-wise into single fastq. Or should I do both and then compare?

Many thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/BIMSBbioinfo/pigx_rnaseq/issues/105, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE32EMPZQIDVWT4YJHAY5LUGQRM5ANCNFSM5F2N2LNA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

smoe commented 2 years ago

i would treat them as technical replicates and do one round of the pipeline, if they end up being very close to each other in PCA or clustering I would concatenate them.

I'll do that, The meta-question is if this situation is sufficiently common to prepare pigx-rnaseq for that.

Otherwise, there might be lane-specific effects you might need to consider as covariates.

Typically, the RNA and its library prep should be the same (physically identical) for each lane. But yes, there can be lane effects. But these effects are all independent from each other, thus that lane 1 from sample 1 is completely independent from lane 1 of sample 2. It is hence not a regular covariate.

My hunch is that something could be learned like a "abundance-gene-length-dependent" standard deviation for gene counts. That we could also get from subsampling(?) after concatenation but that would then not be aware of the lane effect.

borauyar commented 2 years ago

In my experience, technical replicates are much rarer than it used to be. I would think most users won't have them, and if they do, it would be usually okay to merge technical replicates into single files I think.