MikkelSchubert / adapterremoval

AdapterRemoval v2 - rapid adapter trimming, identification, and read merging
http://adapterremoval.readthedocs.io/
GNU General Public License v3.0
105 stars 23 forks source link

--pre-trim-polyx or --post-trim-polyx do NOT work in AdapterRemoval v3.0.0-alpha2 #70

Open realzhang opened 1 month ago

realzhang commented 1 month ago

As the title says, I test --pre-trim-polyx or --post-trim-polyx for my PE read files, but unfortunately, polyA still remains in my R2 file. The command I tested:

adapterremoval3 --pre-trim-polyx --threads 40 --in-file1 test.R1.fq.gz --in-file2 test.R2.fq.gz --out-prefix trimmed --trim-ns --trim-qualities --head 1000

A read in trimmed.r2.fq.gz:

@E200025190L1C001R0010032219/1:GATGGACCTG 2:N:0:AACATA
CATCCAGGCCGTGCTGCTGCCCAAGAAGACCGAGAGCCACCACAAGGCCAAGGGAAAATAAGACCAGCCGTTCACTCACCCGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
FFGFFFFFFFFFFFFFFGFFFFFFFFFFFFFDFFFFFFFFFGFGFFFFFGFDFEFFFFGFFFFFFEFFF=FGFFFGGGCFFFFFFFFFFFFGFFFGFGFGFGFFGFFGFFGG
MikkelSchubert commented 1 month ago

Thank you for testing AdapterRemoval and sorry for the trouble.

However, I have to ask you to attach a complete pair of (untrimmed) reads, since I am unfortunately unable to identify the problem from just the one read you attached.

If you could attach or copy/paste the output from the following command then I should be able to take a closer look:

zgrep -m1 -A3 "^@E200025190L1C001R0010032219/1:GATGGACCTG" test.R1.fq.gz test.R2.fq.gz

I also have to ask you to double-check the command you ran/included in your comment, as it does not appear to be a valid AdapterRemoval command: I'm guessing that the │K3.pear.discarded.fastq.gz K9.pear.unassembled.reverse.fastq.gz part is a copy/paste mistake?

Best, Mikkel

realzhang commented 1 month ago

Sorry for the mis-copy of the command from another panel of my terminal, I've corrected that accordingly. Thanks for the detailed instruction and the followings are the related orignal read pair:

zgrep -m1 -A3 "^@E200025190L1C001R0010032219/1:GATGGACCTG" test.R1.fq.gz test.R2.fq.gz
test.R1.fq.gz:@E200025190L1C001R0010032219/1:GATGGACCTG 1:N:0:AACATA
test.R1.fq.gz-GGTGAGTGAACGGCTGGTCTTATTTTCCCTTGGCCTTGTGGTGGCTCTCGGTCTTCTTGGGCAGCAGCACGGCCTGGATGCTGTCTCTTATACACATCTCCG
test.R1.fq.gz-+
test.R1.fq.gz-EDDFC0FFFFFFDFFEFFCFFFFFFFFFFFFFECF?GFFFFEFFDFEEBFEEEDGDFFDF>FF3CFCFFF312FFFF3FFBFEFFFBFGFFFFFBBF?FFF9
test.R2.fq.gz:@E200025190L1C001R0010032219/1:GATGGACCTG 2:N:0:AACATA
test.R2.fq.gz-CATCCAGGCCGTGCTGCTGCCCAAGAAGACCGAGAGCCACCACAAGGCCAAGGGAAAATAAGACCAGCCGTTCACTCACCCGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGTCCATCTTTGTTAAGTTGGAACAGCCGGCTGGAG
test.R2.fq.gz-+
test.R2.fq.gz-FFGFFFFFFFFFFFFFFGFFFFFFFFFFFFFDFFFFFFFFFGFGFFFFFGFDFEFFFFGFFFFFFEFFF=FGFFFGGGCFFFFFFFFFFFFGFFFGFGFGFGFFGFFGFFGG(FC)B.F8(>8*@(CF))%78@8/F*-&-/#*&57''*

It seems that the polyA originated from the oligo-dT of the PCR primer, together with a read through to illumina flow cell anchors, which may not fit the tailing polyx definition of AR3. Please kindly help me make this clear, thank you!

MikkelSchubert commented 1 month ago

Thank you!

AdapterRemoval looks for trailing stretches identical bases, so a stretch of identical bases inside a read probably won't get detected. This is in part due to the order of operations, where trimming of low quality bases is done after poly-X trimming:

  1. --pre-trim3p
  2. --pre-trim-polyx
  3. Alignment and adapter trimming
  4. --post-trim3p and --post-trim5p
  5. --post-trim-polyx
  6. Trimming of low quality bases

See https://adapterremoval.readthedocs.io/en/v3.0.0-alpha2/detailed_overview.html#read-processing

In this particular case, the stretch of As would probably have been trimmed if step 5 and 6 were swapped, since the last 38 bases are being trimmed by the mott filter[1], which then exposes the higher quality poly-X tail. However, normally you'd expect the tail to be at the end of the read, so so trimming tails first seemed more reasonable to me when I added it. Though I haven't benchmarked that.

In other words, I'm not sure what the best solution would be.

Maybe instead of trying to trim reads like this, it would be better to support filtering of reads with poly-X sequences longer than some threshold? That can help eliminate "weird" reads like this. Another option I've considered is to make it possible to specify the order of operations, but you'd have to actually compare the output with different orders of operations to determine what is the optimal for a given set of data, so it wouldn't be very user-friendly.

[1] I can see that you use both --trim-ns --trim-qualities, which are the "old" trimming options. AdapterRemoval3 now uses a more aggressive mott trimming by default, but you can enable the old method instead via --quality-trimming per-base.

Best, Mikkel

realzhang commented 1 month ago

Thank you for your response. Based on my understanding, a certain trim step in AdapterRemoval3 exposes the intermediate polyX at the tail and subsequently identifies and trims it off. However, determining the appropriate timing for trimming is not easy.

I would like to point out that it would be beneficial to adopt a strategy similar to what trim-galore uses, where, in addition to the Illumina adapters, we can add custom polyX-like adapters (such as A{10}). This way, the continuous polyA and any following sequences would be trimmed off.

I feel that specifying custom adapter sequences in AdapterRemoval3 is not particularly convenient at the moment. Ideally, in addition to automaticly detected adaptors, AdapterRemoval3 could additionally include custom adapter sequences, so we don’t have to deal with the numerous sequences of Illumina adapters.

Thank you very much!

MikkelSchubert commented 1 month ago

Unfortunately AdapterRemoval is currently designed/optimized for trimming adapters that appear at fixed locations, and (optimally) involves just a single adapter sequence or pair of adapter sequences. Properly handling sequences like custom poly-X sequences that can appear anywhere in reads will require some thought, and possibly a completely different alignment algorithm to be performant. I believe that trim-galore/cutadapt uses hashing rather than pairwise alignments like AdapterRemoval. So it probably won't make it into AdapterRemoval 3.0, but I'll try to see what I can do in a future release.

I do intend on making it possible for AdapterRemoval 3.0 to automatically detect known/published adapters when trimming, so that (most of the time) you won't have to specify them manually. But this also requires some care since detection is fallible and it could easily lead to adapter contamination in the output.

But thank you for the feedback! It's very helpful to get another perspective, so do let me know if there are other things that you think could be improved