Combining --identify-adapters with trimming

nickjhathaway commented 6 years ago

Hi, I've really enjoyed using your tool, very well written and well documented. I was wondering if there is any way to or if there is a plan to add a way to combine --identify-adapters with what adapters are used in trimming. I'm working on a large dataset (>3,000 samples) which spans over several years and different labs and there are several different sets of adaptors used. I've written a wrapper that runs --identify-adapters first and then uses the identified adaptors if they differ from the defaults when running the trimming step. I didn't see an option for this so I wanted to check.

Thanks! Nick

MikkelSchubert commented 6 years ago

Hi Nick, I am glad to hear that you've enjoyed using AdapterRemoval.

I've been thinking a bit about your question, and I am not sure that it makes sense to (fully) automate this process.

The consensus adapter sequences constructed by AR are based on the identification of fully overlapping reads, and hence the quality of the consensus sequences greatly depends hence on the size distribution of DNA fragments. For libraries with few short DNA fragments, the consensus sequence is likely to consist largely of false positives (reads starting with palindromes). Because of this, my use of this tool has been to run it on a data set, review the output (consensus sequence and the quality of individual bases, and the most common kmers), and then determine the corresponding adapter from the literature.

I realize of course that this is not practical for more than 3000 samples, so one solution could be to aggregate the output of running --identify-adapters on all samples, identify the common sequences, and then assign the best candidate adapters to each dataset based on (for example) edit distance. I could possibly add an option to AR to output the results in a more easily parseable format, if that would help.

A couple of things to keep in mind in this regard: Firstly, you do not need to use the full consensus sequence and in fact probably should not, since quality tends to decline further from the 5'. AR reports quality scores for each base in the consensus using Phred scores (-10 * log10(1 - N_consensus / N_total)), so you could set a threshold and only consider the high quality part of the consensus. I would also not worry about identifying the exact adapter in all cases. For example, the common 13bp prefix for both forward and reverse Illumina adapters (AGATCGGAAGAGC) is likely to show up a lot, and should be sufficient for efficient trimming. Finally, for data with very few fully overlapping reads and a low quality consensus (as reported by --identify-adapters), it may not even be worthwhile to trim for adapters.

Cheers, Mikkel

nickjhathaway commented 6 years ago

Hi Mikkel, Thanks for your response! So I actually did end up creating an automation script for this. I had to add in a custom program to identify the adapters because I'm working with Plasmodium falciparum and its genome is ~25% tandem repeat which can fairly easily overwhelm the number of sequences contributing to the adapter consensus building since it's easy to have overlap alignment even though there wasn't read through when the fragment is mostly tandem repeat. So I wrote something that filters out alignments due to tandem repeat in reads and then built consensus from only similar overhangs. Next, I make sure it matches a set of supplied primers and then start AdaptorRemoval with those adapters. This worked for ~85% of my samples with the other ones mostly failing, as you pointed out, due to a low amount of fully overlapping sequences due to fragment size and therefore a good consensus couldn't be built in which in case I just ran with default adapters since I like the quality trimming and N trimming that can also be done.

Also, I was wondering about the consensus built from the --collapse option and what is done with the output quality score for a mismatch in the alignment, do you take the quality of the higher quality base. I've seen other stitchers modify the output quality so I was just wondering what AdapterRemoval does.

Lastly, I was wondering what happens when a pair has overhang but doesn't match the supplied adapters, is the pair left as it?

Thanks for your time in answering my questions!

Best, Nick

MikkelSchubert commented 6 years ago

Hi Nick, That sounds like a pretty good solution.

Regarding consensus building, the resulting quality is roughly equal to the sum of Phred encoded qualities (for matches) and roughly the higher quality minus the lower quality (for mismatches). In the case of mismatches, the higher quality base is chosen. If there the quality scores are identical, then AdapterRemoval will currently pick a base at random and that base will be assigned a low quality. This is where the --seed option is used for RNG. However, the current development version also offers a --deterministic flag, in which case the base is set to N if mismatching bases have identical quality.

Regarding overhangs, I'll have to explain how AdapterRemoval aligns reads. If we represent adapter 1 using 'X' and adapter 2 using 'Y', and if we represent mate 1 and mate 2 using 'A' and 'B', and if we represent reverse complemented reads using lowercase, then the following alignment is carried out for each pair of reads:

yyyyyyyyyyyAAAAAAAAAAAAAAA
        bbbbbbbbbbbbbbbXXXXXXXXXXXXXXX

In other words, the mate 2 read is reverse complemented, and the adapter 1 sequence is appended to this sequence, while the reverse complement of the adapter 2 sequence is prefixed to the mate 1 sequence. The best pairwise alignment between these two sequences are then determined, which allows the position of the insert sequence to be identified from the 5' of each read:

yyyyyyyyyyyAAAAAAAAAAAAAAA
        bbbbbbbbbbbbbbbXXXXXXXXXXXXXXX
           |<-------->|

For each pair of reads, AdapterRemoval will select the best alignment (if any), and then trim (or not trim) the reads based on this alignment (if any), with a maximum number of mismatches allowed based on the --mm option. The overhangs in this alignment are ignored, but parts of the reads aligning to the adapter sequences are scored as part of the overall alignment. An alignment where the adapter in the read does not match the user-supplied adapter can therefore still be considered a valid alignment, provided that (1) the score is good enough and that (b) the total number of mismatches do not exceed the max determined by --mm. If alignment is not good enough (for either reason), then no overhangs can be identified.

The --identify-adapters performs the same alignment, but without the inclusion of user-supplied adapters, and then collects the overhanging sequences to reconstruct the adapter 1 and 2 sequences:

   AAAAAAAAAAAAAAA
bbbbbbbbbbbbbbb
|y|            |X|

Best, Mikkel

MikkelSchubert commented 6 years ago

I'm closing this issue as I believe that everything has been addressed. Feel free to open another issue if you have any other questions or run any problems.

Best, Mikkel

MikkelSchubert / adapterremoval

Combining --identify-adapters with trimming #22