MikkelSchubert / adapterremoval

AdapterRemoval v2 - rapid adapter trimming, identification, and read merging
http://adapterremoval.readthedocs.io/
GNU General Public License v3.0
102 stars 23 forks source link

Feature request: demultiplexing in both forward and reverse directions #68

Open krdav opened 3 weeks ago

krdav commented 3 weeks ago

Hello and thanks for a great piece of software. I am looking forwards to the v3 update, so far it looks very promising.

Currently, I am using the demultiplexing feature a lot but in one instance ran into the issue that the sequencing library was prepare using a ligation method which does not preserve the directionality of the barcodes. I.e. samples are barcoded with known barcode sequences but then adapter ligation causes a reversal of the direction in ~50% of the cases.

I would like AdapterRemoval to look for barcodes in both directions and demultiplex them into the same file. Either by allowing multiple barcodes for a single filename e.g.:

cat barcodes.txt
sample_1 ATGCGGA TGAATCT
sample_1 AGATTCA TCCGCAT
sample_2 ATGGATT ATAGTGA
sample_2 TCACTAT AATCCAT
sample_7 CAAAACT TCGCTGC
sample_7 GCAGCGA AGTTTTG

Which currently raises this reasonable excecption: Duplicate sample name 'sample_1'; combining different barcodes for one sample is not supported. Please ensure that all sample names are unique!

Or alternatively, adding a --reverse option. That looks for barcodes in both orientations.

Of course, all of this can be achieved by specifying these reverse barcodes with unique barcode names and post processing to merge, but having it integrated into the demultiplexing step would be preferred.

MikkelSchubert commented 3 weeks ago

Thank you for the kind words.

I'm not opposed to adding something like an --allow-duplicate-samples option for the case where you, for whatever reason, have tagged the same sample with multiple barcodes. That will catch mistakes in the common case, while also enable you to deal with your data in a convenient manner.

It does add a bit of complexity, since reads needs to be trimmed with an adapter corresponding to the barcode that were identified (among other things), but I have some tentative ideas for how I can implement that without it becoming a huge mess. I'll give it a try in the v3 branch, since that's where I'm currently focusing most of my attention and since I can afford to make a mess there, and then back-port it to v2 once I've gotten it working.

Best, Mikkel

krdav commented 3 weeks ago

Related to https://github.com/MikkelSchubert/adapterremoval/issues/56

krdav commented 3 weeks ago

Thanks, Mikkel!

Just to be sure; the mixed directionality thing is not so uncommon. It happens in cases where the Illumina adapters are added using blunt-end or TA ligation.

An --allow-duplicate-samples option would solve my issue for sure, but if it is primarily going to be used for this kind of mixed direction reads, then solving that problem in particular might be more powerful. From the user perspective it gets easier because only one set of barcodes have to be added (the reverse complementary being inferred by AR), from an implementation side of view it might also be easier as only two scenarios exist per barcode(-pair): 1) forward direction and 2) forward and reverse direction. I am also thinking that it would be easier to write a test for this case.

On the flip-side the --allow-duplicate-samples option would be more generalizable.

Regardless, I will be looking forward to test what you come up with.

MikkelSchubert commented 3 weeks ago

Hi Kristian,

Thank you for the additional context. The main difficulty is just handling multiple barcodes per sample, once that is implemented then it would be simple enough to support automatically adding reversed barcodes as well.

There is a minor complication, though, that you might have an suggestion for how to handle: In normal operation AR will generate and use adapter sequences with the barcodes included, so that the reads are trimmed/merged correctly. However, with --demultiplex-only and samples with multiple barcodes, that information is needed downstream but is not readily available. I guess I could append the barcode belonging to the mate read (which will be located in the 3' of the read itself) to the FASTQ header as meta-data or use a tag or read group in SAM/BAM output. So that way the information is available if needed. What do you think?

Best, Mikkel

krdav commented 3 weeks ago

Well, it is hard for me to provide any useful recommendations as I haven't looked much at the source code. However, I think I understand your problem and I cannot think of any better ideas than storing the barcode information as read meta data. For fastq appending it to the header and for SAM format using a suitable tag. There is actually already a barcode tag (BC-tag): http://samtools.github.io/hts-specs/SAMtags.pdf

MikkelSchubert commented 4 days ago

The master branch now features support for a --multiple-barcodes option to allow multiple barcodes for the same sample and a --reversible-barcodes option to match both barcode1-insert-barcode2 and barcode2'-insert-barcode1', where ' indicates reverse complement. If you have the time, then I'd appreciate it if you could give it a try.

Once I've tested it a bit more then I'll try to back-port it to 2.x. That is unfortunately a bit more work than I hoped due to the fact that lots of code operates under the assumption that samples only have one barcode (pair), but it should be doable.

The command-line argument names are also not set in stone, if you have any suggestions for something that more clearly communicates the intent

krdav commented 1 day ago

Wow. This sounds promising. I compiled the most recent version and will test it soon, then I will report back. Thanks for doing this!