dnbaker / dashing

Fast and accurate genomic distances using HyperLogLog
GNU General Public License v3.0
160 stars 11 forks source link

Add support for 10x bam files #12

Closed olgabot closed 5 years ago

olgabot commented 5 years ago

Hello, For computing signatures of single-cell RNA-seq data, it's very convenient to use the 10x bam file directly. This was implemented in sourmash using the Python package bamnostic. For C/C++, I presume the htslib library would be used directly. It's been a while since I stretched my C muscles but I could give it a shot. Warmest, Olga

dnbaker commented 5 years ago

htslib could be used, or we could parse BAM files directly to avoid portability issues accompanying linking against htslib. The simplest thing would be a preprocessing script or executable which took a BAM and wrote a concatenation of all barcodes to individual fastx records before being provided, though this would be rather slow and just a stopgap measure.

After looking over the sourmash modifications, I think something like the following would likely work: (Untested)

import pysam
from collections import defaultdict
import subprocess
d = defaultdict(str)
for read in pysam.AlignmentFile(input_path):
    d[read.opt("RB")] += read.seq + 'N'
fns = []
for k, v in d.items():
    fn = f"{k}.fa"
    fns.append(fn)
    open(fn, "w").write(f">{k}\n{v}\n")
open("tmpfnames.txt", "w").write("\n".join(fns) + "\n")
subprocess.check_call("dashing dist -Ftmpfnames.txt <other args>", shell=True)

We do think this would be a great use case, and will be considering working on native support, as it would be much faster and avoid a lot of temporary files.

You'll probably want to use a rather small sketch size (p = 8 or less (use flag -S8 for 256 bytes per sketch), as the cardinality in question is quite small, and we designed this expecting primarily to work with relatively large sketches.

olgabot commented 5 years ago

That was my initial solution, as well, but creating a dictionary for each cell uses 300GiB+ of memory for bam files with >1000 cells, which is unreasonable. Also, there are Ns in the bam file so separating reads by N wouldn't be unambiguous. Or does this dashing/bonsai ignore all k-mers containing an N? I've been trying to dig into bonsai/bonsai/include/encoder.h to figure out how k-mers are derived but so far have gotten lost in the C++.

There's another PR which sorts the alignment file by barcode first, which is has a much smaller memory footprint.

You'll probably want to use a rather small sketch size (p = 8 or less (use flag -S8 for 256 bytes per sketch), as the cardinality in question is quite small, and we designed this expecting primarily to work with relatively large sketches.

Yes, the RNA-seq profiles probably have ~100k-500k unique k-mers so smaller sketches should be fine.

dnbaker commented 5 years ago

A barcode-based project of mine for ctDNA from years ago used barcode binning by prefix into separate files, followed by hashmap-based collapsing. Sorting by barcode first was my initial method for this project, but I found that I could go from a week or more to demultiplex large experiments to a half hour with that full rewrite. Sorting gets very expensive for billions of reads.

I'm not sure I understand -- will you be comparing pairwise distances between linked reads or between sequencing experiments?

If you want to compare whole RNAseq experiments, you should just use a bam2fq conversion tool (or avoid aligning to begin with) and then use dashing sketch in fastq mode.

If you want to do all comparisons between sets of reads containing individual barcodes, the simple answer is a hashmap from barcode to a HyperLogLog sketch. The barcodes would take 4 bytes each (since 16-base pair barcodes can be encoded in 32 bits). std::unordered_map is pretty slow compared to khash or one of the ska:: maps, but it would work for a proof of concept before a better hash table replaced it.

A draft of this section would look like something like:

// Initialize k, in samFile *, header, and bam1_t *
void *data;
std::unordered_map<u32, hll::hll_t> map;
Encoder enc(k);
while(sam_read1(in, hdr, b) >= 0) {
    const char *tag(((data = bam_aux_get(b, "RB")) != nullptr) ? bam_aux2Z(data): nullptr);
    if(!tag) throw std::runtime_error("Missing RB tag");
    u32 bc = cstr_lut[tag[0]]; // Convert the first character into 2 bits
    for(unsigned i = 1; i < 16; ++i) {
        bc <<= 2; // Shift left by two
        bc |= cstr_lut[tag[i]];  // Encode the next two
    }
    auto it = map.find(bc);
    if(it == map.end()) it = map.emplace(bc, hll::hll_t(log2sz)).first;
    enc.binary_for_each(bam_get_seq(b), b->core.l_qseq, [&](u64 val) {it->second.addh(va);});
    // This function doesn't exist, but it's worth adding code for working from 4-bit encoded DNA instead of ASCII strings; I will work on this soon. (As htslib uses 4 bits to work with Ns)
}

This barcode encoding is similar to bonsai, but there's no reverse complementing because the barcode is the barcode.

It's not parallelized; if the bam was already indexed, then a map/reduce strategy by seeking to separate portions of the file would be useful, but something like this wouldn't hurt for a first pass of the parse, and most time would probably be in the quadratic comparison phase.

Yes, bonsai ignores all kmers containing Ns. Most tools do (e.g., Kraken{1,2}, their competitors). This allows us to use 2 bits per character and therefore fit 31 nucleotides in a 64-bit int, using UINT64_C(-1) as an error code. (You can still use 32, but you won't accurately identify any sequences consisting of 32 Ts correctly. Hopefully that doesn't actually matter.)

I'd have more time to work on this, but we're in the final stages of preparation before submission, so things requiring time will need to wait a bit.

olgabot commented 5 years ago

Huh wow, I didn't realize sorting could be so bad!

I'm not sure I understand -- will you be comparing pairwise distances between linked reads or between sequencing experiments?

Sorry I was unclear. The goal is to compare barcodes both across and within RNA-seq experiments, e.g. cellX from experimentA vs cellY from experimentB, and cellX_1, cellX_2, ... cellX_n from experimentA. So the first option, all pairwise distances between reads associated to a barcode. Using the 10x bam file instead of starting from the raw fastq enables apples:apples comparison of Jaccard similarities vs pairwise distances of the processed count matrix.

Forgot to say before: The conversion of the bam to a bunch of concatenated reads separated by an N is quite clever! I think I'll try that, now that I know sorting sucks so much. Would that mean creating a separate fast{a,q} file for each cell? Or can dashing separate samples on a per-fasta-record basis?

That C++ code is a little over my head with the bit shifting, and I don't think I could write the binary_for_each function, so I'll stick to simple methods for now.

I'd have more time to work on this, but we're in the final stages of preparation before submission, so things requiring time will need to wait a bit.

No worries. I think I have enough to go off of here. Good luck with the submission!

dnbaker commented 5 years ago

dashing can process samples on a per-record basis, but it does this against a set of references rather than all-pairs. You can compare experiments currently with the fastq interface; if the raw fastq has the barcodes, they'd be easier to process directly without adding the htslib requirement. Are the RB tags available in some form there, or are they produced after using alignment information to correct errors? I don't have experience with 10x barcodes, just Loew Lab-style inline barcodes.

It's not just sorting, it's also more efficient encoding of sequence and a more easily parallelized algorithm.

We do plan to implement an efficient method for working on this soon. I think it's ideally suited to the problem, as the memory requirement ends up just being on the order of (4 + sketch size) * nbarcodes, rather than sorting or concatenating at all.

dnbaker commented 5 years ago

A draft implementation (untested) is now available at https://github.com/dnbaker/10xdash.

There’s likely some debugging to do; if you run into problems, if you provide some sample files, I’ll be better able to fix them.

olgabot commented 5 years ago

Thank you! I'll check this out next week.

Olga Botvinnik, PhD olgabotvinnik.com http://www.olgabotvinnik.com

On Sat, Jan 26, 2019 at 7:04 AM Daniel Baker notifications@github.com wrote:

A draft implementation (untested) is now available at https://github.com/dnbaker/10xdash.

There’s likely some debugging to do; if you run into problems, if you provide some sample files, I’ll be better able to fix them.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dnbaker/dashing/issues/12#issuecomment-457838093, or mute the thread https://github.com/notifications/unsubscribe-auth/AAxNcEw-2bbWG6aPY-KPuOq0YhqOu4yQks5vHG56gaJpZM4aPJ3Z .

dnbaker commented 5 years ago

Great!

No rush on this, but also, are these bams aligned/sorted? If they are, there are additional opportunities for parallelism, though currently it just goes through the file linearly and parallizes JI calculations.

Enjoy your weekend!

dnbaker commented 5 years ago

We're continuing to develop this work at https://github.com/dnbaker/10xdash. Feel free to ask further questions here or there, but for now, I'm closing this issue.