Closed JohnMMa closed 3 months ago
Correct, BUS files cannot store barcodes longer than 32 bp's.
For barcodes longer than 32 bp's, you'll either have to extract out a smaller region of the barcode or you have to do an additional preprocessing step to map your longer barcodes to a shorter version of the barcodes.
As I mentioned above, some of key the bustools
subcommands already seem to work normally with long barcodes, like bustools sort
, bustools inspect
, and bustools count
, and I confirm bustools text
and bustools fromtext
also seem to work. I think I should try to test whether sort
, inspect
, and count
's results are actually correct.
But, still, with the rise of combinatorial sequencing which necessitates long barcodes, some kind of solution is needed here.
Those commands will work -- but that doesn't mean they will produce the correct output (I can tell you already that if you're trying to work with more than 32 nucleotides in the barcode, those commands will not give you the correct output). The BUS file only allows up to 32 nucleotides to be stored in the barcodes field. The reason for this is that the barcode is stored as an integer (which is 64-bits on a 64-bit machine) -- each nucleotide occupies 2 bits (therefore max = 32 nucleotides).
It is true that combinatorial sequencing will result in long barcodes in the near future (not just longer barcodes, but more complex barcodes) -- which is why an additional barcode preprocessing step should be done before running kallisto. In fact, this was the primary reason I developed (and am still actively developing) splitcode: https://www.biorxiv.org/content/10.1101/2023.03.20.533521v2.full (just yesterday, I used this on a split-pool sequencing assay with 4 rounds of split-pooling with barcodes >10bp each round).
Because this is one of those wontfix issues, I'm closing it. I already built pipelines powered by splitcode
, as @Yenaled already noticed.
I am testing our usual quantification pipelines with
kallisto
andbustools
, and noticedbustools capture
seems to have problem with its barcode mode, if the barcodes are long (in this case, 38bp).(All the files in the code segments are in the attached tarball, or generated by the code themselves.)
kallisto bus
behaves normally:So were
bustools sort
andbustools inspect
:Also,
bustools count
also behaved normally:However,
bustools capture
does not capture anything with-b
even with we're using some of the barcodes seen above; I would expect there're at least some BUS records output bybustools capture
.kallisto
andbustools
are of the latest version:I'm not so familiar with C++, but it seems many things in the
bustools capture
has a hard-coded limit for 32-byte strings. Does it have anything to do with it?debug_files.tar.gz