chapmanb / bcbb

Incubator for useful bioinformatics code, primarily in Python and R
http://bcbio.wordpress.com
604 stars 243 forks source link

Trailing Illumina 'A' and demultiplexing #46

Closed b97pla closed 12 years ago

b97pla commented 12 years ago

Hi Brad,

We are seeing some issues with unexpectedly many reads ending up in the 'unmatched' category after demultiplexing. After digging around a little, we think that this may be related to the trailing 'A' that the Illumina machines add after the barcode.

More specifically, we allow one mismatch and no indels for the demuxing. It seems that the reads that are unexpectedly classified as unmatched have one mismatch in the actual 6-nucleotide barcode and are, in addition, having the trailing 'A' nucleotide miscalled.

Reading the code, it does indeed seem that for Illumina reads, the last 7 nucleotides, including the trailing 'A', of each read are matched when demultiplexing. Can you confirm that this is the case?

Our preference is to match just the 6-mer index sequence, excluding the last nucleotide in the read and it would be nice to have this done by default for Illumina reads, or at least be able to influence this behavior with a configuration option. What do you think?

Thanks /Pontus

chapmanb commented 12 years ago

Pontus; Thanks for bringing this up. The extra A base on Illumina barcodes is the source of so many problems. Your understanding of how this works is exactly right; we consider the entire barcode including this additional A.

It seems like the best way to handle this would be to have an option to skip bases at the end when generating the barcode ends for comparison. This would at least keep the barcode trimming code general, and then we could ignore the last base for Illumina in the pipeline code.

Does that sound like it will work? Brad

b97pla commented 12 years ago

Yes, this is what I had in mind. I have extended your code so that the barcode_sort_trim.py script accepts an offset parameter that specifies the number of bases to skip at the end (or beginning if 5'-located barcodes). I also modified demultiplex.py to be able to trim off extra A nucleotides from the barcodes. I'd be happy to send you a patchfile if you're interested, I just need to do a little bit of testing first to make sure that it works as intended.

Cheers /Pontus

chapmanb commented 12 years ago

Pontus; That sounds perfect. Thank you for taking a look at this. I'll be happy to get that patch in; send a pull request when you are ready. Thanks again, Brad

b97pla commented 12 years ago

Hi Brad,

Our forks are quite a lot out of sync at the moment so I created diff files for demultiplex.py and barcode_sort_trim.py instead. I hope that is alright with you. The diffs are available here: git://gist.github.com/1393934.git, and they patched fine using "patch -p1 < diff_file". Please let me know if you have any problems or comments.

Thanks /Pontus

chapmanb commented 12 years ago

Pontus; Thanks very much for this. Everything looks great with the barcode_sort_trim.py patch and I checked that in. For demultiplex.py, I'm not sure I understand the intention of the changes in adjust_illumina_tags and add_multiplex_across_lanes. Shouldn't we just be able to set bc_offset to 1 if bc_illumina_no_trailing is True and we need to add the additional A? Could you explain all the other changes that are required? These seem to adjust the logic quite a bit. Thanks again.

b97pla commented 12 years ago

Hi Brad,

Thanks for this! Regarding your question about adjust_illumina_tags, I agree that your proposal seems the most straightforward solution. The reason for the more complex logic is for the case where the barcodes specified in run_info.yaml already have a trailing A that should be trimmed before writing to the tag file in order for it to be ignored. This is relevant for us since we add that trailing A to the barcodes upstream of the pipeline but would still like to ignore it when demultiplexing. However, I understand and agree with your objection since this situation is perhaps not the typical use case but a complication that we've created for ourselves and should address in our upstream processing rather than cluttering the logic of the pipeline. It's also a good point that the bc_offset should only be set once it has been verified that it can be applied.

Regarding the change in add_multiplex_across_lanes, I think that was a piece of v0.2 code that shouldn't actually be there.

Apologies for the confusion, I'll clean up demultiplex.py according to your suggestion and send you a new patch.

Thanks!

b97pla commented 12 years ago

There's an updated patch file for demultiplex.py here: git://gist.github.com/1408680.git. Please let me know if you have any issues with it.

Thanks! /Pontus

chapmanb commented 12 years ago

Pontus; Perfect, all merged in. Thanks very much for this.

chapmanb commented 12 years ago

@b97pla -- we came up with another option to solve this; allow mismatch bases in the barcode specification:

https://github.com/chapmanb/bcbb/commit/e33e5f0dfe980cd532e94df6fd5725fee8050f6f

This might be cleaner to support allowed mismatching on the final A in Illumina barcodes. We could add an N instead of the explicit A. What do you think?

b97pla commented 12 years ago

Yes, I quite like this approach. It is cleaner and it's nice because it also allows for masking arbitrary positions in the barcode that should be disregarded, e.g. because they correspond to a bad cycle. I would be happy to use this.

chapmanb commented 12 years ago

Perfect, I checked this in:

https://github.com/chapmanb/bcbb/commit/26dea3cf741b2b90312147f6896e2068337f5d5c

thanks for all the suggestions and feedback. This is a really nice improvement.