desmodus1984 commented 1 year ago

Hi,

I want to trim EM-SEQ fastq files. I used the same code, first for a single pair, and then for a batch. The code for the first pair was:

trim_galore --2colour 20 --illumina -o trim --paired V00001_R1.fastq.gz V00001_R2.fastq.gz

and the output was: V00001_R1_val_1.fq.gz V00001_R2_val_2.fq.gz

The summary stated trimming mode - paired end:

SUMMARISING RUN PARAMETERS

Input filename: V00001_R1.fastq.gz Trimming mode: paired-end Trim Galore version: 0.6.10 Cutadapt version: 1.18 Number of cores used for trimming: 1 Quality encoding type selected: ASCII+33 Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; user defined) 2-colour high quality G-trimming enabled, with quality cutoff: --nextseq-trim=20 Maximum trimming error rate: 0.1 (default) Minimum required adapter overlap (stringency): 1 bp Minimum required sequence length for both reads before a sequence pair gets removed: 20 bp Output file will be GZIP compressed

Then, for a second pair I used the code:

trim_galore --2colour 20 --illumina --output_dir=trim -j 4 --paired V00021_R1.fastq.gz V00021_R2.fastq.gz

The output files were: V00021_R1_trimmed.fq.gz V00021_R2_trimmed.fq.gz

And the summary:

SUMMARISING RUN PARAMETERS

Input filename: V00021_R1.fastq.gz Trimming mode: paired-end Trim Galore version: 0.6.10 Cutadapt version: 1.18 Python version: could not detect Number of cores used for trimming: 4 Quality encoding type selected: ASCII+33 Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; user defined) 2-colour high quality G-trimming enabled, with quality cutoff: --nextseq-trim=20 Maximum trimming error rate: 0.1 (default) Minimum required adapter overlap (stringency): 1 bp Minimum required sequence length for both reads before a sequence pair gets removed: 20 bp Output file will be GZIP compressed

Why the first pair had the prefix val* while the second just trimmed?

Is there something in the code that I didn't know or was it an effect of using multithreaded mode?

Thanks;

FelixKrueger commented 1 year ago

If you still have files called *trimmed.fq.gz around in paired-end mode, it is likely that the run hasn't completely finished. Once the validation process is complete, both intermediate trimmed.fq.gz files will be deleted.

As a side note, if this trimming is for methylation alignments, I would recommend the trimming setting described here: http://felixkrueger.github.io/Bismark/bismark/library_types/#em-seq-neb

tamuanand commented 1 year ago

Hi @FelixKrueger

Related questions specific to EM-Seq:

I assume one has to explicitly use trim_galore first on the R1/R2 files and then pass the trimmed R1/R2 files to bismark
Based on your comment above, should I explicitly call out --clip_R1 10 --clip_R2 10 --three_prime_clip_R1 10 --three_prime_clip_R2 10 when using trim_galore or should I not - the legend below the table at https://felixkrueger.github.io/Bismark/bismark/library_types/ suggests Default settings (nothing in particular is required, just use Trim Galore or Bismark default parameters)
If OK with you, would you know what would be the equivalent command with bbduk.sh - given that bbduk is java based, I would expect this step will be much faster

Thanks.

FelixKrueger commented 1 year ago

You don't necessarily have to use Trim Galore, but yes some trimming is recommended. the nf-core/methylseq pipeline has an EM-seq switch which should work equally:

--EM-seq

tamuanand commented 1 year ago

I think this still uses Trim Galore under the hood

the nf-core/methylseq pipeline has an EM-seq switch which should work equally:

FelixKrueger / TrimGalore

valX files vs trimmed files? diff output same code? #162

SUMMARISING RUN PARAMETERS

SUMMARISING RUN PARAMETERS