Cuda pinned memory error when switching to 2 GPUs

boconnell89 commented 4 years ago

I've been benchmarking AriocP with one GPU (GTX1070) and things were going fine, however, when I added a second GPU (also 1070) to the system I get the following error:

ApplicationException ([0x00006008] C:\Projects VS120\Arioc\CudaCommon\CudaPinnedPtr.h 106): Unable to allocate page-locked system memory (CUDA "pinned" memory). Please ensure that there is sufficient system memory to execute this program.

I get this even when only using one or the other GPU. The system this is on has 32GB RAM, but the reference is fairly small (hg38 chr10 only) so I thought it would work. I've also tried with the S. cerevisae example data, with the same error. I've also played around with the batchSize. Note this is on Windows at the moment.

RWilton commented 4 years ago

That out-of-memory error occurs when Arioc is allocating page-locked system RAM and mapping it into the CUDA address space.

Assuming you have sufficient system RAM for this operation, one thing you can try is to serialize the memory allocations for the Arioc lookup tables. Would you please re-run Arioc with the following added to its config file, just after the or element and before the element:

<X serialLUTinit="1" />

This will cause Arioc to load the LUTs one after another instead of concurrently on different threads. I realize this is "voodoo" but if there is some non-thread-safe memory-allocation code somewhere in your system that is precipitating the exception you are seeing, this could work around the problem with only a small decrease in overall speed.

If the problem persists, then there might indeed be insufficient system RAM to complete the operation. What does a memory-consumption monitor like top (in Linux) or Performance Monitor (in Windows) tell you about the Arioc process?

boconnell89 commented 4 years ago

Well, the serial LUT loading didn't work, so I tried physically removing the second card, and now it works. Looking at the task manager, it looks like it takes about 10.4GB of memory (out of 32GB system total), which seems to agree with the debugging (for S. cerevisae):

205159468 [00004128] tuGpu::loadR: initialized R buffer (10409584 bytes) in CUDA global memory 205159678 [00004128] CudaGlobalAllocator::CudaGlobalAllocator: CudaGlobalAllocator uses 7103237324 bytes

It's not a huge deal, since I'm primarily bench-marking on my home system so get a sense of the performance vs. bwa-mem.

RWilton commented 4 years ago

Ok, thanks for the info.

A couple of questions:

Page-locked system memory is used for hash table lookups as well as for the reference sequence. How big are the allocations for those tables? (Hint: look at Arioc's output following the line "loadHJ: GPU LUT load starts".)
Did you run Arioc using a 1-bit GPU mask (so as to use only one GPU) while both GPUs were installed?

In any event, if Arioc currently runs on one GPU, you can certainly "get a sense" of its performance. You should expect an increase in throughput of 1.8-1.9x by adding a second GPU (assuming, of course, that you have big enough number of reads for it to matter) if-and-when you get that configuration to work.

If you would like me to help you troubleshoot the two-GPU configuration, please send me Arioc's output for a successful run on one GPU so I'll have some concrete numbers in regard to memory-buffer sizes.

boconnell89 commented 4 years ago

I've uploaded the log file from a successful run. Looks like the bigger reference was 5.0 GB. I'm at least interested to get 2 gpus working, as with one GPU, my desktop is rivaling a dual-socket xenon with 1gpu_log.txt

With a 3.5x10^7 read dataset mapping to a human chromosome, I seem to be maxing out at about 1.7X speedup vs. with a single GPU, a bit slower if you count having to merge the resulting SAM files (am I perhaps missing an option for outputting the results to a single file?).

RWilton commented 4 years ago

Thank you for the log of your test run with the S cerevisiae genome. It all looks pretty nominal.

I would hesitate to say anything about speed with this data, of course, since the software spends almost 4 times as much time initializing the GPUs and loading the lookup tables as it does computing alignments:

220102601 [000035ec] Elapsed: 220102602 [000035ec] total : 7397 220102603 [000035ec] initialize GPUs : 161 220102604 [000035ec] partition Q files : 2 220102605 [000035ec] load/unload R,H,J : 5666 220102606 [000035ec] aligners : 1568 (127551 Q/second)

The aligner throughput (reads per second) seems reasonable for a commodity GPU with only 30,000 threads, although again I think 1 1/2 seconds is a pretty short interval from which to draw any conclusions about throughput.

I don't understand your comments in regard to mapping to a single human chromosome, but in any case even 35 million reads are probably not a sufficient test of throughput. After all, Arioc was designed to align upward of a billion reads to the entire human reference genome, so it seems to me you're still working on a scale that's a couple of orders of magnitude too small to be confident about having obtained a reliable speed estimate.

Of course, if your data doesn't scale to that size anyway, then you probably don't need Arioc. In that case you should probably choose BWA or Bowtie, both of which are fast, reliable CPU-only implementations that should perform well with 16 CPU threads and 32GB of memory.

In any case, please let me know if I can help you troubleshoot the 2-GPU problem. I am still curious what happens when both GPUs are installed and you specify a 1-bit GPU mask in the Arioc config file.

RWilton / Arioc

Cuda pinned memory error when switching to 2 GPUs #8