MikkelSchubert / adapterremoval

AdapterRemoval v2 - rapid adapter trimming, identification, and read merging
http://adapterremoval.readthedocs.io/
GNU General Public License v3.0
106 stars 24 forks source link

Input file is overwritten and cut off #53

Closed TCLamnidis closed 3 years ago

TCLamnidis commented 3 years ago

Hi @MikkelSchubert !

I am looking for a sensible way to separate the adapter clipping functionality of AR from the collapsing functionality, and have run into an odd behaviour.

I am using some public data from the ENA, downloadable here: https://www.ebi.ac.uk/ena/browser/view/PRJEB30331 The md5sums match those of the ENA. I am using version 2.3.2 off bioconda.

I started out by removing the adapters from the fastqs without any filtering or trimming.

AdapterRemoval --file1 ../ERR3003613_1.fastq.gz --file2 ../ERR3003613_2.fastq.gz --basename CS01.pe  \
--adapter1 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC' \
--adapter2 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA' --minadapteroverlap 1 

The resulting files look fine.

$ wc -l CS01.pe.pair*
      37892680 CS01.pe.pair1.truncated
      37892680 CS01.pe.pair2.truncated

I then try to collapse, trim and filter the adapter clipped files:

$ AdapterRemoval --file1 CS01.pe.pair1.truncated --file2 CS01.pe.pair2.truncated --basename CS01.pe \
--qualitymax 41 --trimns --trimqualities --minlength 30 --minquality 20 --collapse

Trimming paired end reads ...
Opening FASTQ file 'CS01.pe.pair1.truncated', line numbers start at 1
Opening FASTQ file 'CS01.pe.pair2.truncated', line numbers start at 1
Error reading FASTQ record at line 24661; aborting:
    partial FASTQ record; cut off after sequence
Aborting thread due to error.
ERROR: AdapterRemoval did not run to completion;
       do NOT make use of resulting trimmed reads!

I then checked the input files again:

$ wc -l CS01.pe.pair*
       600 CS01.pe.pair1.truncated
       600 CS01.pe.pair2.truncated

After multiple tries, it seems that the line at which the error is thrown changes, but it is always 600 lines that remain in the input files.

TCLamnidis commented 3 years ago

A bit of extra context: I am trying to remove adapters in one step and trim and collapse in another because I want to use the demultiplexing functionality of AR to deal with internal barcodes in the dataset. It makes sense to me to do that BEFORE collapsing the reads, but it cannot be done before removing the adapters.

MikkelSchubert commented 3 years ago

The problem is that you are using the same --basename in both your commands, which means that the second command both tries to read from CS01.pe.pair*.truncated, while also writing read-pairs that were not merged to those same files.

Files are opened for writing in a lazy manner (part of the support for file handle limits needed while demultiplexing many samples), so AdapterRemoval manages to read a bit of the files before producing output that is then written back to the same files, truncating them in the process.

You could modify your commands as follows to avoid this problem:

$ AdapterRemoval --file1 ../ERR3003613_1.fastq.gz --file2 ../ERR3003613_2.fastq.gz --basename step1.CS01.pe  \
--adapter1 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC' \
--adapter2 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA' --minadapteroverlap 1 

$ AdapterRemoval --file1 step1.CS01.pe.pair1.truncated --file2 step1.CS01.pe.pair2.truncated --basename step2.CS01.pe --qualitymax 41 --trimns --trimqualities --minlength 30 --minquality 20 --collapse

With that out of the way, I am not clear on your motivation for doing this. Why can you not demultiplex the reads before removing the adapters? If the barcodes are located at the 3' end of reads, then you cannot use AdapterRemoval to demultiplex the reads like you say you want to, and if the barcodes are located as the 5' then AdapterRemoval already handles demultiplexing, adapter (and complementary barcode) trimming, and merging in the, to my knowledge, correct order.

TCLamnidis commented 3 years ago

Thank you for the clarification! I will retry with a different basename.

The motivation for doing this is linked to https://github.com/MikkelSchubert/adapterremoval/issues/50, dealing with sample-specific barcodes that come AFTER the adapter sequence.