BUStools / bustools

Tools for working with BUS files
https://bustools.github.io/
BSD 2-Clause "Simplified" License
91 stars 23 forks source link

bustools capture does not work on long barcodes? #98

Closed JohnMMa closed 1 month ago

JohnMMa commented 10 months ago

I am testing our usual quantification pipelines with kallisto and bustools, and noticed bustools capture seems to have problem with its barcode mode, if the barcodes are long (in this case, 38bp).

(All the files in the code segments are in the attached tarball, or generated by the code themselves.)

kallisto bus behaves normally:

[jma@prodhpc01 debug]$ kallisto bus -i FeatMismat.idx -o . -x 0,0,38:0,38,46:1,0,0 debug_data.1.fastq debug_data.2.fastq

[bus] Note: Strand option was not specified; setting it to --unstranded for specified technology
[index] k-mer length: 15
[index] number of targets: 344,908
[index] number of k-mers: 161,521
[quant] will process sample 1: debug_data.1.fastq
                               debug_data.2.fastq
[quant] finding pseudoalignments for the reads ... done
[quant] processed 325,682 reads, 315,129 reads pseudoaligned

So were bustools sort and bustools inspect:

[jma@prodhpc01 debug]$ bustools sort -o output_sorted.bus output.bus
 all fits in buffer
Read in 315129 BUS records
reading time 0.01s
sorting time 0.03s
writing time 0.01s
[jma@prodhpc01 debug]$ bustools inspect output_sorted.bus
Read in 314812 BUS records
Total number of reads: 315129

Number of distinct barcodes: 106285
Median number of reads per barcode: 1.000000
Mean number of reads per barcode: 2.964943

Number of distinct UMIs: 52296
Number of distinct barcode-UMI pairs: 314539
Median number of UMIs per barcode: 1.000000
Mean number of UMIs per barcode: 2.959392

Estimated number of new records at 2x sequencing depth: 314178

Also, bustools count also behaved normally:

[jma@prodhpc01 debug]$ bustools count -o proteincnt -g kall_idx.t2g.txt -e matrix.ec -t transcripts.txt --genecounts output_sorted.bus
[jma@prodhpc01 debug]$ cat proteincnt.barcodes.txt | head -n 5
TCAGCGAATGTATGAACAGAGGTAAACCTACCTCAGCG
TCCTGCAATGTATGAACAGAGGTAAACCTACCTCCTGC
TCTATAAATGTATGAACAGAGGTAAACCTACCTCTATA
TCTCTCAATGTATGAACAGAGGTAAACCTACCTCTCTC
GGTGAGAATGTATGAACAGAGGTAAACGTCTAGGTGAG

However, bustools capture does not capture anything with -b even with we're using some of the barcodes seen above; I would expect there're at least some BUS records output by bustools capture.

[jma@prodhpc01 debug]$ cat proteincnt.barcodes.txt | head -n 5 > selected_barcodes.txt
[jma@prodhpc01 debug]$ cat selected_barcodes.txt
TCAGCGAATGTATGAACAGAGGTAAACCTACCTCAGCG
TCCTGCAATGTATGAACAGAGGTAAACCTACCTCCTGC
TCTATAAATGTATGAACAGAGGTAAACCTACCTCTATA
TCTCTCAATGTATGAACAGAGGTAAACCTACCTCTCTC
GGTGAGAATGTATGAACAGAGGTAAACGTCTAGGTGAG

[jma@prodhpc01 debug]$ bustools capture -b -c selected_barcodes.txt -o output_captured.bus output_sorted.bus
Read in 314812 BUS records, wrote 0 BUS records
[debug_files.tar.gz](https://github.com/BUStools/bustools/files/13342508/debug_files.tar.gz)
[jma@prodhpc01 debug]$ bustools inspect output_captured.bus
Read in 0 BUS records
Total number of reads: 0

Number of distinct barcodes: 0
Median number of reads per barcode: 0.000000
Mean number of reads per barcode: -nan

Number of distinct UMIs: 0
Number of distinct barcode-UMI pairs: 0
Median number of UMIs per barcode: 0.000000
Mean number of UMIs per barcode: -nan

Estimated number of new records at 2x sequencing depth: 0

kallisto and bustools are of the latest version:

 [jma@prodhpc01 debug]$ kallisto
kallisto 0.50.1

Usage: kallisto <CMD> [arguments] ..

Where <CMD> can be one of:

    index         Builds a kallisto index
    quant         Runs the quantification algorithm
    quant-tcc     Runs quantification on transcript-compatibility counts
    bus           Generate BUS files for single-cell data
    h5dump        Converts HDF5-formatted results to plaintext
    inspect       Inspects and gives information about an index
    version       Prints version information
    cite          Prints citation information

Running kallisto <CMD> without arguments prints usage information for <CMD>

[jma@prodhpc01 debug]$ bustools
bustools 0.43.1

Usage: bustools <CMD> [arguments] ..

Where <CMD> can be one of:

sort            Sort a BUS file by barcodes and UMIs
correct         Error correct a BUS file
umicorrect      Error correct the UMIs in a BUS file
count           Generate count matrices from a BUS file
inspect         Produce a report summarizing a BUS file
allowlist       Generate an on-list from a BUS file
project         Project a BUS file to gene sets
capture         Capture records from a BUS file
merge           Merge bus files from same experiment
text            Convert a binary BUS file to a tab-delimited text file
extract         Extract FASTQ reads correspnding to reads in BUS file
predict         Correct the count matrix using prediction of unseen species
collapse        Turn BUS files into a BUG file
clusterhist     Create UMI histograms per cluster
linker          Remove section of barcodes in BUS files
compress        Compress a BUS file
inflate         Decompress a BUSZ (compressed BUS) file
version         Prints version number
cite            Prints citation information

Running bustools <CMD> without arguments prints usage information for <CMD>

I'm not so familiar with C++, but it seems many things in the bustools capture has a hard-coded limit for 32-byte strings. Does it have anything to do with it?

debug_files.tar.gz

Yenaled commented 10 months ago

Correct, BUS files cannot store barcodes longer than 32 bp's.

For barcodes longer than 32 bp's, you'll either have to extract out a smaller region of the barcode or you have to do an additional preprocessing step to map your longer barcodes to a shorter version of the barcodes.

JohnMMa commented 10 months ago

As I mentioned above, some of key the bustools subcommands already seem to work normally with long barcodes, like bustools sort, bustools inspect, and bustools count, and I confirm bustools text and bustools fromtext also seem to work. I think I should try to test whether sort, inspect, and count's results are actually correct.

But, still, with the rise of combinatorial sequencing which necessitates long barcodes, some kind of solution is needed here.

Yenaled commented 10 months ago

Those commands will work -- but that doesn't mean they will produce the correct output (I can tell you already that if you're trying to work with more than 32 nucleotides in the barcode, those commands will not give you the correct output). The BUS file only allows up to 32 nucleotides to be stored in the barcodes field. The reason for this is that the barcode is stored as an integer (which is 64-bits on a 64-bit machine) -- each nucleotide occupies 2 bits (therefore max = 32 nucleotides).

It is true that combinatorial sequencing will result in long barcodes in the near future (not just longer barcodes, but more complex barcodes) -- which is why an additional barcode preprocessing step should be done before running kallisto. In fact, this was the primary reason I developed (and am still actively developing) splitcode: https://www.biorxiv.org/content/10.1101/2023.03.20.533521v2.full (just yesterday, I used this on a split-pool sequencing assay with 4 rounds of split-pooling with barcodes >10bp each round).

JohnMMa commented 1 month ago

Because this is one of those wontfix issues, I'm closing it. I already built pipelines powered by splitcode, as @Yenaled already noticed.