MikkelSchubert / adapterremoval

AdapterRemoval v2 - rapid adapter trimming, identification, and read merging
http://adapterremoval.readthedocs.io/
GNU General Public License v3.0
102 stars 23 forks source link

Internal barcodes #50

Open apeltzer opened 3 years ago

apeltzer commented 3 years ago

Hi @MikkelSchubert !

hope you're doing good - we're currently discussing a bit on how / whether AR2 is able to remove internal barcodes - is this currently supported ( I think not ?) and would it be something that could be added in a new release at some point?

x-ref / issue where we started discussing a little: https://github.com/nf-core/eager/issues/632

MikkelSchubert commented 3 years ago

Hi,

I'm afraid that I'll have to ask you to clarify what you mean by internal barcodes in this context, as I am a bit rusty on the terminology.

Cheers

apeltzer commented 3 years ago

Hi Mikkel!

I've asked the requester(s) to provide some insights for this :-)

jfy133 commented 3 years ago

Hi Mikkel,

To be able to measure barcode hopping on some machines, people have started ligating very short (~6-7bp) 'barcodes' directly onto the extracted DNA molecules, prior to adapter+index ligation.

image

Figure 1 of https://www.biorxiv.org/content/10.1101/179028v3.full.pdf

So in principle what this request would involve would be 1) the initial removal of adapters, 2) new a second pass of removal, to remove a second user-specified sequence.

As far as I know people typically only use a single barcode per sample out of a pool of maybe 12 barcodes. I guess if a user specifies these as a list (like with --adapter-list), this would be sufficient.

I guess in principle one could use the --identify-adapters functionality, but this doesn't actually do the trimming, and also the user should already know the actual barcode so for 'precision' it would make sense they can specifically define that.

Let me know if this is not clear...

Edit: to clarify as the barcodes are sample specific, you would have to allow the user to specify this as a list of possible barcode, in pipeline contexts (such as eager).

MikkelSchubert commented 3 years ago

Thank you for the detailed explanation!

Unless I am misunderstanding something, then barcodes of this type are already supported via the demultiplexing functionality. This is enabled when the user provides a table of sample names and barcodes with the --barcode-list option, such as these:

sample_1 ATGCGGA TGAATCT
sample_2 ATGGATT ATAGTGA
sample_7 CAAAACT TCGCTGC

The first column is used in output filenames, the second specifies the P7 barcode, and the third (optional) column specifies the P5 barcode. AdapterRemoval uses the barcodes to map reads/read pairs to samples, at which point the barcodes are removed from the 5' of each read. After that, adapter trimming is carried out using per-sample query sequences generated by merging the opposing barcode with the adapter sequence, so that both are trimmed from the reads.

There's a small example in the examples folder that you can run with

AdapterRemoval --file1 demux_1.fq --file2 demux_2.fq --basename output_demux --barcode-list barcodes.txt

It is also possible to just do the demultiplexing, if you want to do adapter trimming with a different trimmer. The combined barcode+adapter sequences are listed in the resulting settings files for each sample.

If I recall correctly, then it is currently possible to demultiplex using P7 barcodes or using P7 + P5 barcodes, but not P5 barcodes by themselves.

See here for more information: https://adapterremoval.readthedocs.io/en/latest/examples.html#demultiplexing-and-adapter-trimming

jfy133 commented 3 years ago

Hi @MikkelSchubert ,

Ok, that does indeed sound like a possibility! I will investigate and see if we can get it to work as expected by the people who requested it, otherwise we will come back to you.

Cheers,