Illumina / Isaac3

Aligner for sequencing data
Other
18 stars 2 forks source link

isaac: reduce I/O on himem machines #7

Closed sklages closed 8 years ago

sklages commented 8 years ago

Hi, I want to distribute different isaac (human/mouse) mappings on different servers all writing on the same (high performance) storage system during alignment. That is our computing setup. We do have a lot of servers with more than 500G RAM.

How do I choose the concurrency parameters to produce as little I/O as possible? I do see that multiple jobs on multiple servers dramatically kill I/O performance on shared filesystem.

I also see a lot of warnings like this in isaac's output:

2016-09-16 00:17:48  [7f866ebe5700] WARNING: Holding up processing of bin: BinMetadata(200id ReferencePosition(15:40562784:0f)bs 10140696bl 708874476ds 0do 0se 1277975rs 1250093f /path/to/temp/bin-00000016-00000200.dat) until std::bad_alloc clears. Error data: 982017348

2016-09-16 00:19:39  [7f88621da700] WARNING: Holding up processing of bin: BinMetadata(217id ReferencePosition(16:60844176:0f)bs 10140696bl 796554434ds 0do 0se 1434240rs 1408124f /path/to/temp/bin-00000017-00000217.dat) until std::bad_alloc clears. Error data: 1103634210

2016-09-16 00:25:44  [7f86703e8700] WARNING: Holding up processing of bin: /path/to/temp/bin-00000000-00000000.dat until a load slot is available

2016-09-16 00:26:03  [7f86713ea700] WARNING: Holding up processing of bin: /path/to/temp/bin-00000000-00000000.dat until a load slot is available

and runtimes more than 10h (usually max. 2h).

This happened with:

--input-concurrent-load 10 
--temp-concurrent-load 10 
--output-concurrent-save 4 
--temp-concurrent-save 4

I tried with different concurrency setting, with similar results.

So when I run isaac with one or two jobs, everything is fine. When I try to run 20 Jobs on 10 servers writing on the same filesystem, performance descreases dramatically. So I'd like to opimise conncurrency parameters to possibly decrease file I/O.

Any idea what to do? I could use /dev/shm but I really like to avoid that.

rpetrovski commented 8 years ago

Having iSAAC temp on network is a really bad idea. Even if the throughput of the networking is sufficient (and it looks like in your case the network can only sustain two concurrent iSAAC runs), you will suffer from high latency of network operations. We normally use locally attached SSD to avoid temp file IO bottleneck.

There is a certain amount of data per sample (about 2-3x the input size) that has to go into temp and be read from temp, so there is nothing you can do by playing with IO concurrency settings.

If you don't have local storage, I suppose with 500G you can try using RAM disk. Then it depends on how big your sample is. With iSAAC-03 you should be able to limit the isaac-align with -m60, then you have at least 400G ram which might be just enough to store the temp files for some applications.

On a separate note: "until std::bad_alloc clears" warnings indicate that you could theoretically make bam generation faster by running isaac-align with higher -m. Since each bin can be processed independently in bam generation, iSAAC will try to load as many of them as -m allows. If the memory limit prevents bins from being loaded, they will be loaded at later stage after the earlier bins have been processed and written out into the output files. This however is very secondary to avoiding temp on the network.

Roman.

sklages commented 8 years ago

Thanks Roman, that's bad news .. so our setup is kind of "incompatible" to isaac's mode of operation :-( I'll think about using RAM disk instead, but this brings other issues for me ..