amplab / snap

Scalable Nucleotide Alignment Program -- a fast and accurate read aligner for high-throughput sequencing data
https://www.microsoft.com/en-us/research/project/snap/
Apache License 2.0
287 stars 66 forks source link

Original fastq order #127

Open jkbonfield opened 3 years ago

jkbonfield commented 3 years ago

Hello,

I see you are now working on a dev branch of this (affineGap), so please consider this as a feature request.

SNAP is astoundingly fast, especially when driven to the limit with eg -f -D 0. While not producing ideal alignments, my usage is perhaps a little different.

I'd like to be able to leverage this as a potential for rapid but effective fastq compression. It aligns over a million reads / sec on my 8 core system. These can then be piped into samtools to output compressed CRAM, giving around 4 fold reduction over the size of gzipped fastq, but still at a fast rate (> 500k seq/sec). That's actually faster than the rate that pigz compresses my test data(!), although not as fast as bgzip due to libdeflate which is 3x quicker than pigz. (You may wish to consider using that over zlib.)

However the order of output is not quite the same. I believe this is because each thread is writing a block of data as it fills the buffer, but these aren't allocated and written in order. Thus there is randomness in the output order. Instead if they're put into an output queue and written in order it would be possible to use snap+cram as a fastq compressor, with "samtools fastq" as the fastq decompressor. This would probably need to be optional as it may have a slight performance hit.

A related trivial request would be to not emit the SAM header (or to replace it with my own). That's necessary to provide it a prefilled out header with M5 tags so that samtools can more rapidly do CRAM conversion via mmaping in md5 sequence files, rather than parsing and loading a ref.fa file.

The obvious next improvement would be directly linking against htslib so it can emit the correct format directly. This would also permit considerably faster SAM and BAM encoding.

bolosky commented 3 years ago

We’re days away from releasing a 1.0 version of SNAP with a ton of improvements over the old version (that’s what’s in the affineGap branch now). It’s too late to get more features in that release, but we can think about doing it later.

You’re right that the output order is randomized because of multi-threading. It would be a little tricky to get it to come out the same way that it went in, though I’m pretty sure it’s possible at some performance cost.

There’s also some support in SNAP to not align at all, which would of course be faster (if for no other reason than not loading the index). I don’t think this is exposed in the current code, but it would be easy enough to do.

There’s also an undocumented option to leave the @SQ lines out of the SAM header, though not the rest of the header. We put that in for SURPI because they had so many contigs that it was taking forever to write out a header that they just ignored. There’s a filetype called -samNoSQ that will do it. So, you’d do something like

snap single indexDirectory input.fastq -o -samNoSQ output.sam

It wouldn’t be very hard to add another filetype that’s SAM with no header at all.

And, of course, we really should just do CRAM input and output. Probably not by using htslib, since that’s not going to work on Windows (and SNAP development is all at Microsoft now). And, TBH, I suspect that we could write a faster version anyway, like we were able to do with the other formats. But that’s going to be work that I’m not sure when we’ll get time to do.

--Bill

From: James Bonfield notifications@github.com Sent: Wednesday, November 11, 2020 4:25 AM To: amplab/snap snap@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [amplab/snap] Original fastq order (#127)

Hello,

I see you are now working on a dev branch of this (affineGap), so please consider this as a feature request.

SNAP is astoundingly fast, especially when driven to the limit with eg -f -D 0. While not producing ideal alignments, my usage is perhaps a little different.

I'd like to be able to leverage this as a potential for rapid but effective fastq compression. It aligns over a million reads / sec on my 8 core system. These can then be piped into samtools to output compressed CRAM, giving around 4 fold reduction over the size of gzipped fastq, but still at a fast rate (> 500k seq/sec). That's actually faster than the rate that pigz compresses my test data(!), although not as fast as bgzip due to libdeflate which is 3x quicker than pigz. (You may wish to consider using that over zlib.)

However the order of output is not quite the same. I believe this is because each thread is writing a block of data as it fills the buffer, but these aren't allocated and written in order. Thus there is randomness in the output order. Instead if they're put into an output queue and written in order it would be possible to use snap+cram as a fastq compressor, with "samtools fastq" as the fastq decompressor. This would probably need to be optional as it may have a slight performance hit.

A related trivial request would be to not emit the SAM header (or to replace it with my own). That's necessary to provide it a prefilled out header with M5 tags so that samtools can more rapidly do CRAM conversion via mmaping in md5 sequence files, rather than parsing and loading a ref.fa file.

The obvious next improvement would be directly linking against htslib so it can emit the correct format directly. This would also permit considerably faster SAM and BAM encoding.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famplab%2Fsnap%2Fissues%2F127&data=04%7C01%7Cbolosky%40microsoft.com%7C2028d1812c3d487e167d08d8863cc3b4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637406942825881702%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=lu8wM9rq8W%2FmfgOwX346JdXw1X5Ndi69Jh8TCojOPCw%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAHPTWJ4FCXTBKXNII32BE3SPJ7ATANCNFSM4TR5W2BA&data=04%7C01%7Cbolosky%40microsoft.com%7C2028d1812c3d487e167d08d8863cc3b4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637406942825891696%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=VZOGoQblmy8AUDGBR9KLyOzgUtj07OVrAevU8dq58So%3D&reserved=0.

jkbonfield commented 3 years ago

Thanks for the reply. Some useful hints there.

NoSQ is sufficient I think - I can just prepend my output with a bunch of amended SQ lines before pipeing into the SAM->CRAM conversion tool.

I do need alignments, albeit basic, as CRAM uses the alignment between sequence and reference for compression purposes. However with Illumina data the number with indels tends to be pretty small so a basic match gets you most of the way I'd think.

As for htslib - it does support Windows and it's part of our continuous integration testing, but not the MSVC compiler. We build using mingw, but it ought to be linkable to a program built using MSVC I think. It's possible CRAM could be implemented faster, but probably not hugely so and sadly it's a very complex beast. I'd also doubt your SAM parser is more efficient and BAM definitely not as you use zlib. BAM is dominated by the Deflate algorithm, and htslib's use of libdeflate trounces zlib. The htslib memory management lets it down a bit though. (Of course you could, and probably should, use libdeflate yourselves even if it's just for fastq.gz decoding.)

Looking forward to 1.0. :-)