genomic data modeling - Githubissues

ghost commented 4 years ago

Hi awesome project,

I have been playing with the tiledb-VCF library, and I am wondering if there are any design documents for bam and fastq files. Any chance people have been thinking about a data model for this other data? Slice support would be very useful for fastq files since it's pretty difficult to get random access to a gzipped fastq file. Currently, i'm sharding gzipped fastq files on ingestion to my system so that I can process them across the cluster.

stavrospapadopoulos commented 4 years ago

Hi there! Yes on FastQ! See here: https://github.com/TileDB-Inc/TileDB-FastQ

We've been waiting for someone to give it a try and give us feedback. We thought it would be extremely useful for the use case you are describing (especially on the cloud), but we are seeking for more validation before we put more cycles into it.

Another benefit to note with this approach is that in the future we can make our filters pluggable (i.e., user-defined via lambdas), which means that any person out there with a brilliant custom DNA data compressor could just implement a single piece of code (compression/decompression), without having to re-invent the wheel around L1 cache-aware parallelism (which we do via tile chunking), parallel IO, numerous optimization for cloud object stores, versioning, metadata, etc.

We have also experimented in a private repo with BAM files. That is also another potential project for us to focus on in the future, but it is a bit more complex than FastQ.

We are eager to hear some feedback.

ghost commented 4 years ago

Oh! Awesome.

So the reads and the quality scores are their own column? Seems like this could seriously drop the size of a FastQ file? Are there any benchmarks size wise?

We’re finishing the base of some of our infra over the next couple months, and will be doing a more serious evaluation for some infra on our rnd side. I’ve tinkered a bit with the VCF package and like it.

I think the goal is going to be something like a narrow waisted stack where Arrow schemas are the narrow waist and below that tiledb would be the disk format. Above arrow would just be some classes around the usual pandas, numpy, dask as an evidence access layer. Having all of our ad hoc queries on a stack like this would help us with a clear query dialect and auditing. Right now querying over the hts formats really sucks, and moving the files around makes auditing ad hoc querying almost impossible. I think, if we can grok how hard it would be to have an isomorphism between the hts files and the tiledb representation, then we would be able to more concretely pitch this. The isomorphism basically has to be there since some people won’t move off of old lab tooling.

I would like to eventually try and get a distributed bwa job going from an array. How would you recommend going about this? I think in order to map individual reads all I would need is a way to read a slice of FastQ records from a buffer into a struct. I think from there I can use the bwa c api to feed the struct into the mapper. Would you recommend just getting the buffers from tiledb api and then trying to use arrow to get a struct?

Here’s what the BWA api looks like: https://github.com/10XGenomics/rust-bwa

stavrospapadopoulos commented 4 years ago

So the reads and the quality scores are their own column?

Yes.

Seems like this could seriously drop the size of a FastQ file? Are there any benchmarks size wise?

This is what I had originally thought, but unfortunately the savings were minimal. The problem, as always, are the qualities. So, you can use bzip2 with a large compression level and get about 5-10% reduction. We should do some more benchmarking, but our initial tests showed that unfortunately the savings are not substantial.

I think the goal is going to be something like a narrow waisted stack where Arrow schemas are the narrow waist and below that tiledb would be the disk format.

We are thinking about integrating with Arrow in the various TileDB APIs, so that we could directly expose the Arrow buffers with the results. That may save you some time implementing your own Arrow layer.

I would like to eventually try and get a distributed bwa job going from an array.

That's exactly the use case I had in mind when I was pitching this project to our team internally. Assuming that BWA is (or could become) embarrassingly parallelizable, I would grab a batch of contiguous reads per process and execute BWA on them separately (thus the fast random slicing of TileDB-FastQ would be of enormous help). Then I would somehow stitch the results in the final BAM file (unless that last step is not necessary after all). I don't know the BWA C API well, although I'll try to find the time to dig in. The goal should always be to work on zero-copied buffers, instead of copying read by read into separate objects. If that's possible with the current BWA API, then it will really be as fast as it can be at the IO level for the TileDB-stored FastQ data.

I am happy to discuss further. You can also reach out to me (stavros@tiledb.com) or the team (hello@tiledb.com) if you are up for scheduling a call. This use case is of extreme interest to us and we believe we can make a difference with enabling fast random slicing from FastQ data.

ghost commented 4 years ago

This is what I had originally thought, but unfortunately the savings were minimal.

Ah that's surprising to me. Is it that the qualities column is really high entropy?

Assuming that BWA is (or could become) embarrassingly parallelizable

It is embarrassingly parallel. There's a caveat: Each process has to have access to an exact copy of the reference genome that you're mapping to. There's a couple ways I have experimented with doing this, but it seems like the easiest way is to just, on bootstrap of your node pool, pull the data to a path on the host like /ref and then have all the k8s workers that are binpacked onto the node pool use a hostpath mount. If you're processing several sequencer runs, then the cost is amortized and nbd. I have seen some people use a fuze layer over an object store and a RWX mount on each worker pod, but this...seems sketch to me.

If that's possible with the current BWA API, then it will really be as fast as it can be at the IO level for the TileDB-stored FastQ data.

Ya that is possible, and I think you would get some insane benchmarks. You might be able to get really really insane benchmarks with this SIMD version of bwa considering your columnar storage and cache aware IO.

Genome assemblers would get a lot of use out of this slicing as well.

I am happy to discuss further.

Sure, i'll shoot you guys an email. I am head underwater for a while so won't be able to do any serious evaluation for probably a couple months.

ghost commented 4 years ago

If you're curious I found this fork of BWA that uses arrow recordbatches and a plasma cache. Looks like it might be pretty straight forward to do a distributed mapper on-top of the tiledb-fastq package.

https://github.com/tahashmi/bwa/blob/master/README.md

TileDB-Inc / TileDB

genomic data modeling #1703