asrivathsan / ONTbarcoder

25 stars 3 forks source link

get original read ids for demultiplexed reads #7

Closed chrishah closed 1 year ago

chrishah commented 1 year ago

Hi,

First of all, I want to say: Thanks for developing ONTbarcoder! It works really well for most of our applications!

This post is more of a feature request than an actual issue - I was wondering: Is it possible somehow to retrieve the original sequence ids for the reads that end up in the demultiplexed bins? Or whether there is any system of how the sequence ids in the demultiplexed files are created that I can use to get back to the original ids?

The thing is that the demultiplexing of ONTbarcoder works really well, but I have some datasets where the amplicon is not part of a coding sequence, so ONTbarcoder's correction doesn't work, or at least it doesn't accept any of the final consensus sequences as reference level barcodes. I understand that this is expected since it's developed for protein coding sequences. It would just be great to get the binned reads including the quality scores so I can forward them to another tool.

THanks!

Best wishes, Christoph

asrivathsan commented 1 year ago

Hi Christoph

Thanks for your interest and feedback. Yes, the pipeline loses the quality score information. regarding how to get it,the demultiplexed sequences are in the same order as the input fastq file are as follows:

1_X_X_X ATGC... 2_X_X_X ATGC...

Therefore lines 1,2,3,4 of file correspond to sequence ID 1. lines 5,6,7,8 correspond to sequence ID 2

and sequence ID >2279145_1_0_1, corresponds to 2279145 4-3, 2279145 4-2, 2279145 4-1, 2279145 4th lines.

The tricky bit is for ligated products: if there is "p2" or "p1" in sequence ID, its a ligated product, demultiplexed from same sequence. In that case one would need to find a range for corresponding to each product for getting the quality scores.

I can give you a code for it/get a version with original IDS compiled but schedule is little packed atm, may need a bit of time

cheers Amrita

asrivathsan commented 1 year ago

The latest version retains the original read IDs