bcgsc / btllib

Bioinformatics Technology Lab common code library
Other
21 stars 5 forks source link

btllib/indexlr fails to read barcode information #71

Closed aafshinfard closed 1 year ago

aafshinfard commented 1 year ago

Linked-read barcode information is usually present in the header in various formats. Three cases I have seen: @V10002828L1C001R013000000#543_288_92/1 1 with barcode placed between # and / @V10002828L1C001R013000000_1 BX:Z:543_288_92 with barcode placed after signs BX:Z: @A00428:24:H5327DSXX:2:1101:1253:1000 1:N:0:CTGTAACT with barcode placed after signs 1:N:0: (may be more complex, like the 0 may be any number? look further into the example file)

The current btllib code supports the first and the second one, but not the last one (seen in 10x data from T2T). Would be nice to have this supported.

an example dataset with that last format


$ pigz -dc /projects/btl/datasets/hsapiens/CHM13/T2T/10x/CHM13_interleved_all.fq.gz | head -n1
@A00428:24:H5327DSXX:2:1101:1253:1000 1:N:0:CTGTAACT
aafshinfard commented 1 year ago

A colleague pointed out the 3rd format is the raw format before running Longranger, so closing the issue.