RWilton / Arioc

Arioc: GPU-accelerated DNA short-read alignment
BSD 3-Clause "New" or "Revised" License
55 stars 8 forks source link

Allocation error #7

Closed standage closed 5 years ago

standage commented 5 years ago

Greetings!

I'm testing Arioc on several single-end read sets. Encoding the reference and the Fastq files completed with no apparent errors, but the AriocU invocation terminated in seconds with the following error message.

134823012 [0000bdac] AriocU v1.30.2427.18290 (release)
134823013 [0000bdac] Copyright (c) 2015-2018 Johns Hopkins University.  All rights reserved.
134823013 [0000bdac]  data type sizes      : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5
134823013 [0000bdac]  executable file      : 
134823013 [0000bdac]  configuration file   : /scratch/standage/temp/config/align-unpaired.xml
134823013 [0000bdac] ApplicationException ([0x48556] CppCommon/WinGlobalPtr.h 134): realloc failed to allocate 0 bytes
134823013 [0000bdac] AriocU ends (1)

I tried fiddling with the batchSize configuration a bit, but this doesn't seem to make a difference. I wasn't hopeful anyway since the problem looks like it's related to attempting to allocate zero bytes as opposed to attempting to allocate too many bytes.

Is this an error with which you're familiar? Are there any obvious mistakes in the configuration files?

Config files

RWilton commented 5 years ago

Hello, Daniel --

I'll have a look tomorrow (Thursday).

At first glance, the only anomaly I see in the .cfg files is the GPU mask. Am I mistaken to assume that you have four GPUs mapped as device 0, 1, 2, and 3, and that you wish to use all of them? If that is what you want, then the mask should be 0x000F (bits 0-3 set).

In the meantime, could you please tell me about the hardware you're using (especially system RAM, GPUs, and disk hardware [HDDs, SSDs, NVMe, whatever])? And could you send along the output from the AriocE genome-encoding runs as well?

Also, you might re-run AriocU with verboseMask="0xE000031F" in the AriocU element. That enables a detailed function-call and memory-management trace that might give us an idea where things are going wrong. If you do this, please send me the output.

Thanks.

standage commented 5 years ago

The machine I'm using has 4 NVIDIA Tesla P100 GPUs (3584 CUDA cores each, 14336 cores total), 1TB of RAM, and 56 CPUs. The disk partition I'm running on is a spinning hard drive with about 3TB capacity. The machine runs CentOS 7.

RWilton commented 5 years ago

I was unable to reproduce this problem, and after four weeks I am going to assume that "no news is good news" at your end.

If the problem recurs, let us know and please send along some troubleshooting output (see above).

Thanks!

standage commented 5 years ago

Thanks for the note. I actually didn't notice all the edits you made to your original response, even though many (if not all) came before my reply. Sorry about that.

Am I mistaken to assume that you have four GPUs mapped as device 0, 1, 2, and 3, and that you wish to use all of them? If that is what you want, then the mask should be 0x000F (bits 0-3 set).

That is correct, thanks for the pointer. The documentation for gpuMask simply says "bits corresponding to GPU devices" and I mistakenly assumed the mask was a bit vector similar to the SAM format's flag field.

Also, you might re-run AriocU with verboseMask="0xE000031F" in the AriocU element.

This is helpful. There were only a handful (+/- 5) of additional function/memory calls before it fails, so the failure happens very early on in program execution. I've made a few changes to the data since creating this ticket (for one, I have the data properly split into paired files). So I'm going to try to re-run things from a clean slate and re-capture all my logs before I come back here for troubleshooting.

Cheers!

standage commented 5 years ago

I've re-run all commands from scratch and captured the debugging output. All the config files and logs are posted here. The error message in question is in align-reads.log, and although I'm now using AriocP this is precisely the error I was getting originally with AriocU.

Also included below for convenience.

155528478 [00005daa] AriocP v1.30.2486.18361 (release)
155528478 [00005daa] Copyright (c) 2015-2018 Johns Hopkins University.  All rights reserved.
155528478 [00005daa]  data type sizes      : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5
155528478 [00005daa]  executable file      : /scratch/standage/arioc-human/bin/AriocP
155528478 [00005daa]  configuration file   : /scratch/standage/arioc-human/config-align-reads.xml
155528479 [00005daa] WinGlobalPtr allocated 16408 bytes at 0x00000000015d5060
155528479 [00005daa] WinGlobalPtr allocated 0 bytes at 0x00000000015d00f0
155528479 [00005daa] WinGlobalPtr allocated 640 bytes at 0x00000000015d9080
155528479 [00005daa] WinGlobalPtr freed 640 bytes at 0x00000000015d9080
155528479 [00005daa] WinGlobalPtr freed 16408 bytes at 0x00000000015d5060
155528479 [00005daa] ApplicationException ([0x00023978] CppCommon/WinGlobalPtr.h 134): realloc failed to allocate 0 bytes
155528479 [00005daa] AriocP ends (1)
RWilton commented 5 years ago

Thanks for taking the time to share your output log files.

Unfortunately, as you can infer from the log, the fault you are seeing occurs during program initialization, that is, somewhere in the first few milliseconds of execution during which initialization of static C++ variables is occurring and the program is "sniffing" its environment for hardware resources.

I tried and failed to reproduce the error using the same AriocP build under GNU/Linux, so the next step will be to try to hunt down a local machine with the requisite GPU hardware and with the same OS (CentOS) and C++ compiler that you're using.

Alternatively, if you can spare a few CPU cycles and are willing to let me SSH over to your test machine, I'll put the thing into a debugger and see where it fails.

RWilton commented 5 years ago

I am still unable to reproduce the error you are seeing with AriocP using a machine with CentOS Linux release 7.5.1804.

If you want to organize a way for me to debug on your computer, then please let me know. Alternatively, we might just be able to figure out what's failing if you run AriocP under gdb and send me a stack trace when it fails.

standage commented 5 years ago

Thanks for your responses.

Unfortunately, I'm unable to provide you access to the machine. I will fire up gdb and send a stack trace ASAP, although the machine is down for maintenance at the moment. More soon!

UPDATE: Unfortunately the gdb stack trace didn't reveal anything new. It simply reported "No stack."

standage commented 5 years ago

At appears to be an issue with parsing the config file, particularly the <A> block. Digging a bit deeper and liberally applying printf statements, I've isolated the problematic line in src/CppCommon/RaiiDirectory.cpp. My line numbers are a bit off due to all the debugging statements I've added, but it looks like this.

// shrink the buffer that contains the filenames
Buffer.Realloc( ofsNextInBuf, false );

I'm not familiar with the object types used here, so I've been unable to print out the contents of Buffer or ofsNextInBuf to help with debugging. :-(

RWilton commented 5 years ago

Well, the fact that the error is reproducible only by you poses a conundrum, since I can't run a debugger in your runtime environment nor elicit the error on any of the Windows or Linux systems to which I have access.

So let's step back for a moment.

Can you please encode and align the S. cerevisiae sample in the Arioc distribution? This should take only about 5 minutes start-to-finish, and you need not change any code or data in order to try it. Whether or not the same memory-allocation error occurs will help me identify where best to look for a possible cause for the problem.

Thanks again.

standage commented 5 years ago

Good news. The S. cerevisiae demo data worked just fine. I revisited my AriocP config file after that and was able to isolate a culprit. When overwrite="True" is included in the <A> tag, things run fine. When that attribute is missing, the original allocation error occurs, regardless of the contents (or lack of contents) in the output directory.

So it looks like there is a minor bug there to be fixed, but it looks like for my purposes I'll be able to proceed without any issues.

Thanks!

RWilton commented 5 years ago

I finally found a machine on which this allocation error is reproducible. The error is related to the way in which some Linux implementations of malloc() and realloc() behave when called with a byte count of zero.

The current Arioc release (v1.30.2497) contains a workaround that eliminates the error.

Thank you for your patience and for pointing me in the right direction by experimenting with the sample configuration files in the Arioc distribution.