DavidsonGroup / flexiplex

The Flexible Demultiplexer
https://davidsongroup.github.io/flexiplex/
MIT License
22 stars 2 forks source link

reads with two barcode and one umi #41

Closed ashokpatowary closed 3 months ago

ashokpatowary commented 3 months ago

Will it be possible to use flexiplex to demultiplex reads with the following patter where we will have two different barcode files. We want thee output compatible with FLAMES

-x [flank1] -b ????????? -x [flank2] -u ???????? -b ?????????? -x TTTTTTTTTTTTT -f 8 -e 2

Thanks

nadiadavidson commented 3 months ago

Hi, For these instances I usually run flexiplex twice (piping the output from one barcode search into the input of another. e.g.: flexiplex -x [flank1] -b ????????? -x [half of flank2] | flexiplex -x [other half of flank2] -u ???????? -b ?????????? -x TTTTTTTTTTTTT > result.fastq

Then the read IDs will look like "barcode1_UMI#barcode2_UMI#orginal_read_ID", which may not be compatible with FLAMES, but could be converted to something which is through standard bash tools like sed and cut. @ChangqingW may be able to comment on what would be compatible with FLAMES.

Cheers, Nadia.

ChangqingW commented 3 months ago

Yeah I think Nadia's suggestion is probably the most practical solution now with flexiplex. If you have significant amount of chimeric reads that might complicate things though, you could get more than 2 reads at the end from 1 chimeric read. Your protocol looks very interesting, is this published yet?

nadiadavidson commented 3 months ago

Yeah I think Nadia's suggestion is probably the most practical solution now with flexiplex. If you have significant amount of chimeric reads that might complicate things though, you could get more than 2 reads at the end from 1 chimeric read. Your protocol looks very interesting, is this published yet?

@ChangqingW What is the read ID format that FLAMES expects? Is that documented anywhere?

ChangqingW commented 3 months ago

It expects @BC_UMI#Anything as outputted by flexiplex, everything after the first # will be ignored. I'll add this to FLAMES' documentation.

ashokpatowary commented 3 months ago

Hi @nadiadavidson and @ChangqingW

Thanks for the suggestion. Its ScaleBio technology which I am trying to adapt for long read sequencing. Identifying the barcode is easy if using PacBio plateform; but with ONT its little tricky; but I think flexiplex can handle it. Thanks for the FLAMES suggestion; I think I can modify it with awk.

@nadiadavidson I have another follow up question. If i try running flexiplex with -u ???????? -b ?????????? and -k with barcode file (barcode sequences 10bp) it throws the following error; but if I reduce the "-b" wild character length to 7 "?" with barcode files having 10bp barcodes; it works. Any suggestion whats going on

Setting max barcode edit distance to 2
Setting number of threads to 24
For usage information type: flexiplex -h
No filename given... getting reads from stdin...
Searching for barcodes...
terminate called after throwing an instance of 'std::out_of_range'
what():  basic_string::substr
Aborted
nadiadavidson commented 3 months ago

Thanks @ashokpatowary that's interesting to know. If you get it working, it would be great to add this use case to our documentation and/or presents.

I've had a similar error in the past when processing ParseBio data and I believe it was due to truncated reads where the barcode was partially cut-off. I got around this by adding some "buffer" sequence to the end of each read and then adding that back into the flanking search sequence (so it was trimmed). This was not ideal and I had meant to post a github issue here about it. If you have a small toy dataset which reproduces the issue, we can take a look in more detail.

Cheers, Nadia.

Also pinging @olliecheng about this.

yxsee commented 3 months ago

Hi, I've tried flexiplex on ParseBio datasets which have similar barcoding structure. It will be great if flexiplex -i true can split chimeric reads and add barcode sequence to read ID without trimming sequences, so that the full flanking sequence can be used in both flexiplex runs. Perhaps sequence removal can be added as a separate option so that we can invoke it only in the last flexiplex run:

flexiplex --trim FALSE -x [flank1] -b ????????? -x [full flank2] | flexiplex --trim TRUE -x [full flank2] -u ???????? -b ?????????? -x TTTTTTTTTTTTT

ashokpatowary commented 3 months ago

Thanks @nadiadavidson and @yxsee.

I think I can get it done by using the following; however since upstream flanking sequence is not specified there is a chance of having false positive

flexiplex-linux -k lig_seq.txt -n Lig -x CTACACGACGCTCTTCCGATCT -b ????????? -x TCAGAGC -u ???????? -f 2 -e 2 test.fastq -p 24 | flexiplex-linux -n rt -x '' -b ?????????? -x TTTTTTTTTT -f 2 -e 2 -k rt_seq.txt

I thereafter tried sed "/[@,+]/! s/^/START/g" | flexiplex-linux -n test_rt -x START; it through same error. Thereafter I tried sed "/[@,+]/! s/^/START/g| flexiplex-linux -n rt -x '' -b ?????????? -x TTTTTTTTTT -f 2 -e 2 -k rt_seq.txt same error because I introduce 4 character at the stat of the sequence. To check that 2nd barcode is not causing any trouble I ran flexiplex-linux -n test -x TCAGAGC -u ???????? -b ?????????? -x TTTTTTTTTT -f 2 -e 2 -k rt_seq.tx test.fastq that works fine identifying the 2nd barcodes. I am not sure what causing the issue if I run it two times. I will happy to share a test file.

Thanks

ChangqingW commented 3 months ago

@ashokpatowary Could you check if my branch fixes the out_of_range error? The handling of truncated UMIs is still inconsistent at the moment but should resolve the error during UMI extraction.

ashokpatowary commented 3 months ago

Hi @ChangqingW; unfortunately the branch through the same error

Searching for barcodes...
0.1 million reads processed..
0.2 million reads processed..
0.3 million reads processed..
0.4 million reads processed..
0.5 million reads processed..
terminate called after throwing an instance of 'std::out_of_range'
 what():  basic_string::substr
Aborted
ChangqingW commented 3 months ago

Hi @ChangqingW; unfortunately the branch through the same error

Searching for barcodes...
0.1 million reads processed..
0.2 million reads processed..
0.3 million reads processed..
0.4 million reads processed..
0.5 million reads processed..
terminate called after throwing an instance of 'std::out_of_range'
 what():  basic_string::substr
Aborted

Can you use the flexiplex-linux binary from my branch, do ulimit -c unlimited before running flexiplex, and share the core.xxx dump file? It should be pretty small and can be uploaded to the issue comment.

olliecheng commented 3 months ago

@ChangqingW @ashokpatowary Thanks for your feedback and bug report. I’ve moved the discussion to a separate issue (#43) so it’s easier to find and track; I’ll close this issue now. If you have any more discussion relevant to the original issue, feel free to reopen. Cheers!