amplab / snap

Scalable Nucleotide Alignment Program -- a fast and accurate read aligner for high-throughput sequencing data
https://www.microsoft.com/en-us/research/project/snap/
Apache License 2.0
287 stars 66 forks source link

Snap takes 3 hours to align two reads #83

Closed taltman closed 3 years ago

taltman commented 7 years ago

Using the machine and database profile of my previous issue, I did a test run with a FASTQ file containing two reads in it. It took over three hours to align the two reads. A reminder that the large Snap DB was residing on a RAM-disk:

taltman$ time snap-aligner single snap_db ~/repos/snap-express/tmp/test.fq
Welcome to SNAP version 1.0beta.23.

Loading index from directory... 11470s.  3818550256 bases, seed size 20
Aligning.
Total Reads    Aligned, MAPQ >= 10    Aligned, MAPQ < 10     Unaligned              Too Short/Too Many Ns     Reads/s   Time in Aligner (s)
2              1 (50.00%)             1 (50.00%)             0 (0.00%)              0 (0.00%)                 74        0

real    191m16.876s
user    0m0.047s   
sys     111m49.779s
sfederman commented 7 years ago

From the log, it's showing that the database took 11470s just to load into RAM (presumably from the RAMdisk into a second copy of working RAM) - the alignment time was negligible (0s). How big is the database relative to your total RAM?

taltman commented 7 years ago

128 GB. See #81 for a full description of the setup.

taltman commented 7 years ago

Machine has 1 TB of RAM.

sfederman commented 7 years ago

Hard to know what's happening, but it's true you should have much faster loading. I haven't looked at RAMDisk load times in awhile for SNAP -we're using SSDs with our setup in Ubuntu 14/16. I'd do some benchmarking to see if this is a hardware setup issue...

bolosky commented 7 years ago

The Linux IO system leaves something to be desired when working with very large files like this.

There are a couple of things you can try to mitigate the problem. One is to try messing around with the –map and –pre flags. –map will memory map the index rather than reading it. Without –map, with your RAMDisk solution you wind up with three copies of the index in memory: one in the RAMdisk, one in system cache and one in SNAP’s memory. –map will get rid of one of these copies (the one in SNAP’s memory), which not only saves memory but also saves the time to copy the index. –pre is there because some versions of Linux have very bad performance when faulting in memory mapped files from the disk (RAMDisk in this case) to memory. It will prefetch the file into system cache before mapping it. In your case, you can experiment with them to see what works, keeping in mind that once the file is in the system cache that –pre will most likely just slow things down.

There are two other things that you can do to mitigate large index load times. One is to run multiple alignments with the same index on the same command line. If you do that, it will only load the index once. This works by separating the alignments with a comma on the command line, for example:

snap-aligner single my-giant-index input.fq –o output.bam –map , paired my-giant-index input2a.fq input2b.fq –o output2.bam , single different-index input3.fq –o output3.bam -map

Will run three alignments, but will only do two index loads, for the first and third alignments, because the second one uses the index already loaded for the first one. Note that the comma needs to have spaces on either side of it for this to work properly.

And the final thing that you can do is to use daemon mode. This will cause snap to fire up and wait for external commands. Essentially, this is like doing multiple alignments separated by commas, except that you don’t need to know what they are ahead of time. Do:

snap-aligner daemon

and then send commands using the SNAPCommand app:

SNAPCommand single my-giant-index input.fq –o output.bam –map

SNAPCommand paired my-giant-index input2a.fq input2b.fq –o output2.bam

SNAPCommand single different-index input3.fq –o output3.bam –map

Will do the same three alignment runs with the same two index loads as the example with the commas, but in this case you don’t need to know up front what they will be, so you can leave SNAP running with the index loaded as new work comes in. The output that would ordinarily come from the SNAP application instead comes from SNAPCommand. It’ll only do one alignment at a time. “SNAPCommand exit” makes the daemon exit.

It does all work much better on Windows. The index load time using –map for an index that’s already in cache is ~30s for a human genome sized index, so probably about two minutes for your larger one. –pre isn’t necessary because Windows’ virtual memory system does a much better job efficiently faulting in files, too. Getting a new index loaded is still limited by the IO system bandwidth, though, which for most machines is < 200 MB/s, so 128GB would be > 10 minutes.

--Bill

From: Scot Federman [mailto:notifications@github.com] Sent: Monday, January 30, 2017 10:15 AM To: amplab/snap snap@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [amplab/snap] Snap takes 3 hours to align two reads (#83)

Hard to know what's happening, but it's true you should have much faster loading. I haven't looked at RAMDisk load times in awhile for SNAP -we're using SSDs set up in a RAID 0 with our setup in Ubuntu 14/16. I'd do some benchmarking to see if this is a hardware setup issue...

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famplab%2Fsnap%2Fissues%2F83%23issuecomment-276143855&data=02%7C01%7Cbolosky%40microsoft.com%7Cd14b01e2df044eaca0ca08d4493bf389%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636213969223670400&sdata=vhAYrIWbws5rarnJA%2FV9pzwd30BVutlBRZUW4LjYMWk%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA752RNAvEZo5LxzSezbFbH3ig1JSy_dks5rXii1gaJpZM4Lxje1&data=02%7C01%7Cbolosky%40microsoft.com%7Cd14b01e2df044eaca0ca08d4493bf389%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636213969223670400&sdata=gAP%2FC7%2Fte%2FVbEYRhXQ5G7xWHDeqtqZb4LUqSNuXQX2w%3D&reserved=0.