RWilton / Arioc

Arioc: GPU-accelerated DNA short-read alignment
BSD 3-Clause "New" or "Revised" License
55 stars 8 forks source link

AriocP hangs when aligning WGBS data #30

Closed RWilton closed 1 year ago

RWilton commented 1 year ago

[@aedera This is a followup to issue #29.]

I am experiencing problems when using AriocP with the encoding generated by the new build. It runs without throwing errors but it gets stuck during the mapping.

This is the conf file used by AriocP

<?xml version="1.0" encoding="utf-8"?>
<AriocP batchSize="48k">
  <R>../ba-encode.m</R>
  <nongapped seed="ssi84_2_30_CT" />
  <gapped seed="hsi25_0_30_CT" Wmxgs="2_6_5_3" Vt="L,-25,2" />
  <Q filePath="../ba-encode.m">
    <paired subId="1">
      <file>SRR770598_1</file>
      <file>SRR770598_2</file>
    </paired>
  </Q>
  <A overwrite="true" cigarFormat="MIDS">
    <sam report="cdru">.</sam>
  </A>
</AriocP>

And this is the output before stopping it after being there for a couple of hours.

120428160 [00064cd1] AriocP v1.51.3129.22314 (release) +2022-11-10T17:27:27
120428164 [00064cd1] Copyright (c) 2015-2022 Johns Hopkins University.  All rights reserved.
120428164 [00064cd1]  compiled             : 2022-11-11T11:14:17 (GNU g++ v9.4.0)
120428164 [00064cd1]  data type sizes      : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5 (little-endian)
120428164 [00064cd1]  computer name        : camus
120428164 [00064cd1]  operating system     : Ubuntu 20.04.4 LTS
120428164 [00064cd1]  executable file      : /mnt/ssd1/aedera/ba-nvbs/ba-hg38.m/ba-arioc.m/testing_new_version_of_arioc.m/ca-mapping.m/AriocP
120428164 [00064cd1]  configuration file   : /mnt/ssd1/aedera/ba-nvbs/ba-hg38.m/ba-arioc.m/testing_new_version_of_arioc.m/ca-mapping.m/file.cfg
120438233 [00064cd1] AriocBase::loadR: load R from disk...
120438233 [00064cd1] AriocBase::loadR: reference file (subId 0) ignored: ../ba-encode.m/SRR770598_1$a21.sbf
120438233 [00064cd1] AriocBase::loadR: reference file (subId 0) ignored: ../ba-encode.m/SRR770598_2$a21.sbf
120439568 [00064cd1] AriocBase::loadR: loaded 24 R sequences: 2353273816 bytes (2.2GB) from 48 files in 1335ms (1681.1MB/s)
120439568 [00064cd1] AriocBase::loadHJ: load H and J LUTs from disk...
120439568 [00064cf1] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 5368709120 bytes, ... ) on thread 412913
120442406 [00064cf1] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
120444310 [00064d03] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 29387864448 bytes, ... ) on thread 412931
120455980 [00064d03] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
120506723 [00064d73] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 5368709120 bytes, ... ) on thread 413043
120509885 [00064d73] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
120512550 [00064d87] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 29407356656 bytes, ... ) on thread 413063
120526951 [00064d87] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
120554069 [00064cd1] AriocBase::loadHJ: loaded H and J LUTs from disk in 74501ms (890.1MB/s)
120554069 [00064dfc] void tuWatchdog::main: watchdog interval 0s
120555059 [00064dfd] AriocBase::UpdateProgress:   0%: 0 pairs (0 mates) aligned
Command terminated by signal 2
21419.89user 82.51system 5:59:50elapsed 99%CPU (0avgtext+0avgdata 70373596maxresident)k
7560inputs+14464outputs (33major+21161134minor)pagefaults 0swaps

It seems that two files are being skipped, but not sure if this is expected. Is my conf file okay?

RWilton commented 1 year ago

[@aedera This is a followup to issue https://github.com/RWilton/Arioc/issues/29.]

Ok, here is what I think is happening:

First, a minor problem: when you place the encoded read-sequence files (SRR770598_*.sbf) in the same subdirectory as the encoded reference genome files (chr*.sbf), AriocP thinks that those encoded read sequences are part of the reference genome. But nothing unexpected happened when you ran AriocP because the associated reference-genome metadata file contained only information about the reference genome files, so AriocP ignored the encoded reads and emitted the warning message you noticed.

A better arrangement is to place the encoded reference-genome files generated by AriocE in a location where they can be referenced for all subsequent invocations of the aligner (AriocP). For example, in a compute cluster, those files should reside on a shared network device that is "visible" from any node in the cluster. In contrast, the encoded read sequences are "temporary". You can delete them as soon as you are satisfied that alignment has been completed successfully.

The "hang" while aligning is a separate problem.

When I downloaded SRR770598 and encoded it, I noticed that this WGBS sample contains several million highly-repetitive reads, that is, reads that contain long, repeating nucleotide patterns. In AriocE's output, these are reads with the lowest triplet complexity scores:

134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: Triplet complexity score (c3) distribution for ./raw/SRR770598_2.fastq:
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution:  c3 nReads
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution:  1   634889
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution:  2    95792
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution:  3    44607
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution:  4    38370
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution:  5    32723
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution:  6    31476
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution:  7    29227
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution:  8    25664
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution:  9    31652
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 10    31817
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 11    40243
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 12    36308
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 13    48386
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 14    53414
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 15    59236
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 16    68848
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 17    82644
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 18   100039
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 19   118654
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 20   147862
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 21   184935
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 22   240704
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 23   305551
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 24   393810
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 25   509490
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 26   629161
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 27   829299
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 28  1048950
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 29  1317881
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 30  1738504
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 31  2215426
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 32  2945025
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 33  3686959
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 34  5176483
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 35  6411811
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 36  8371741
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 37 11786896
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 38 13863166
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 39 18286149
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 40 21264246
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 41 23203187
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 42 23719140
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 43 25681215
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 44 23445197
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 45 22330489
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 46 16697866
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 47 13238876
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 48  9424200
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 49  8523448
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 50  6692108
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 51  6285061
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 52  5600022
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 53  3614699
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 54  2496991
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 55  1257269
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 56   363107
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 57    45776
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 58     6764
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 59      607
134845573 [0011e786] tuEncodeFASTQ::summarizeC3Distribution: 60       24

Now, imagine what happens when the aligner tries to use seed-and-extend to look for alignments for those very repetitive reads. Those reads contain little or no information and hence are impossible to align, but the aligner doesn't know that! Instead, it proceeds to extract seeds from those sequences, and each of those seeds is associated with many possible reference genome locations -- and all that seed-and-extend alignment computation uses GPU memory.

With this in mind, I tried to align SRR770598 with your configuration file. AriocP promptly threw a CUDA memory exception:

190222166 [0037e766] ApplicationException ([0x03663718] CudaCommon/CudaGlobalAllocator.cpp 104): unable to allocate 11482009600 bytes (2638435712 bytes available)

So now we had a "nondeterministic failure mode" (hangs on one machine, CUDA memory error on a different machine) that suggested some kind of data-dependent resource exhaustion. Again, I had to consider whether those repetitive reads might be the cause of the problem.

Read aligners mitigate this problem by limiting the number of seed locations at which they compute alignments for a read. In Arioc, this limit is specified as the maxJ= parameter of the <gapped> and <nongapped> elements in the configuration file. So I tried to enforce a reasonable limit by adding maxJ= to your configuration:

<?xml version="1.0" encoding="utf-8"?>

<AriocP batchSize="16k" gpuMask="0x000F">
  <R>/datascope/rwilton/scratch/Arioc/R/GRCh38.p14/enc.CT</R>

  <nongapped seed="ssi84_2_30_CT" maxJ="200" />
  <gapped seed="hsi25_0_30_CT" Wmxgs="2_6_5_3" Vt="L,-25,2" maxJ="200" />

  <Q filePath="/datascope/rwilton/scratch/Arioc/Q/SRR770598/enc">
    <paired srcId="1" subId="0">
      <file>SRR770598_1</file>
      <file>SRR770598_2</file>
    </paired>
  </Q>
  <A overwrite="true" cigarFormat="MIDS">
    <sam report="cdru">./sam</sam>
  </A>
</AriocP>

And AriocP completed without errors:

191148283 [0037e8b5] AriocP v1.51.3129.22314 (release) +2022-11-10T17:27:27
191148289 [0037e8b5] Copyright (c) 2015-2022 Johns Hopkins University.  All rights reserved.
191148289 [0037e8b5]  compiled             : 2022-11-11T17:04:35 (GNU g++ v8.5.0)
191148289 [0037e8b5]  data type sizes      : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5 (little-endian)
191148289 [0037e8b5]  computer name        : volta04.idies.jhu.edu
191148289 [0037e8b5]  operating system     : Red Hat Enterprise Linux release 8.5 (Ootpa)
191148289 [0037e8b5]  executable file      : /home/rwilton/Arioc/bin/AriocP
191148289 [0037e8b5]  configuration file   : /srv/local/Arioc/A/SRR770598/AriocP.cfg
191148766 [0037e8b5] AriocBase::loadR: load R from disk...
191149486 [0037e8b5] AriocBase::loadR: loaded 27 R sequences: 2353369096 bytes (2.2GB) from 54 files in 720ms (3117.1MB/s)
191149486 [0037e8b5] AriocBase::loadHJ: load H and J LUTs from disk...
191149486 [0037e8e6] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 5368709120 bytes, ... ) on thread 3664102
191151557 [0037e8e6] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
191153411 [0037e8e7] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 29388619640 bytes, ... ) on thread 3664103
191204584 [0037e8e7] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
191214719 [0037e8eb] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 5368709120 bytes, ... ) on thread 3664107
191216733 [0037e8eb] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
191218580 [0037e8ec] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 29408114488 bytes, ... ) on thread 3664108
191230547 [0037e8ec] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
191241608 [0037e8b5] AriocBase::loadHJ: loaded H and J LUTs from disk in 52122ms (1272.3MB/s)
191241608 [0037e8ee] void tuWatchdog::main: watchdog interval 0s
191242149 [0037e8ef] AriocBase::UpdateProgress:   0%: 0 pairs (0 mates) aligned
192106439 [0037e8f0] AriocBase::UpdateProgress:  10%: 29563270 pairs (59126540 mates) aligned (ETA 83 minutes)
192957341 [0037e8f2] AriocBase::UpdateProgress:  20%: 59123391 pairs (118246782 mates) aligned (ETA 72 minutes)
193822397 [0037e8f0] AriocBase::UpdateProgress:  30%: 88681831 pairs (177363662 mates) aligned (ETA 61 minutes)
194646960 [0037e8f0] AriocBase::UpdateProgress:  40%: 118238590 pairs (236477180 mates) aligned (ETA 52 minutes)
195553678 [0037e8f1] AriocBase::UpdateProgress:  50%: 147795349 pairs (295590698 mates) aligned (ETA 44 minutes)
200437485 [0037e8f0] AriocBase::UpdateProgress:  60%: 177353789 pairs (354707578 mates) aligned (ETA 35 minutes)
201347375 [0037e8f0] AriocBase::UpdateProgress:  70%: 206912229 pairs (413824458 mates) aligned (ETA 26 minutes)
202209575 [0037e8ef] AriocBase::UpdateProgress:  80%: 236472350 pairs (472944700 mates) aligned (ETA 17 minutes)
203042169 [0037e8f2] AriocBase::UpdateProgress:  90%: 266030790 pairs (532061580 mates) aligned (ETA 8 minutes)
204021012 [0037e8f2] AriocBase::UpdateProgress: 100%: 295584084 pairs (591168168 mates) aligned
204021067 [0037e8b5] AriocBase::releaseGpuResources: GPU LUT unload starts...
204029434 [0037e8b5] AriocBase::releaseGpuResources: GPU LUT unload complete in 8367ms
204029434 [0037e8b5]
204029434 [0037e8b5] ----------------------------
204029434 [0037e8b5]
204029451 [0037e8b5]   SAM output:
204029451 [0037e8b5]    pairs                          : 295584084 (591168168 SAM records)
204029451 [0037e8b5]     concordant pairs              : 107781045 (215562090 SAM records) (36.46%)
204029451 [0037e8b5]      with 1 mapping               :  97870919 (195741838 SAM records)
204029451 [0037e8b5]      with 2 or more mappings      :   9910126 ( 19820252 SAM records)
204029451 [0037e8b5]     discordant pairs              :   6311776 ( 12623552 SAM records)
204029451 [0037e8b5]     unmapped pairs                : 181491263 (362982526 SAM records)
204029451 [0037e8b5]      two mates mapped ("rejected"):    764391 (  1528782 SAM records)
204029451 [0037e8b5]      0 or 1 mates mapped          : 180726872 (361453744 SAM records)
204029451 [0037e8b5]    mates not in paired mappings   : 362982526
204029451 [0037e8b5]     with no mappings              : 329652836
204029451 [0037e8b5]     with 1 mapping                :  33329690
204029451 [0037e8b5]     with 2 or more mappings       :         0
204029451 [0037e8b5]    total mapped mates             : 261515332 (44.24%)
204029451 [0037e8b5]    duplicate mappings (unreported): 851684811
204029451 [0037e8b5]    estimated TLEN distribution    :
204029451 [0037e8b5]     TLEN mode (mean, sd, skewness):       100 (145.4, 50.8, 3.662)
204029451 [0037e8b5]     TLEN-discordant pairs         :      7373 (0.01%)
204029523 [0037e8b5] AriocP ends (0)

A performance-related comment on these results: the sensitivity, as measured by the fraction of reads that have proper (concordant) mappings, is unacceptably low. This means that the aligner worked hard (and ultimately failed) to find reportable mappings for a majority of the reads in the sample, and that the usual aligner heuristic of prioritizing perfect or near-perfect mappings was ineffective. This would have killed throughput and no doubt contributed to excessive consumption of CPU and GPU memory.

Given the number of unusable read sequences and the low percentage of reported proper mappings, my guess is that your threshold alignment score (Vt="L,-25,2", i.e., 175, or fewer than 4 mismatches) is too stringent for this WGBS sample. It is also possible that these reads need some preprocessing (particularly adapter and quality-score trimming) before you align them. Once you are sure about using an appropriate Vt and about having done adequate preprocessing, you can do some performance tuning with Arioc to try to get a better tradeoff between speed and sensitivity by using an even smaller value for maxJ= and/or by specifying fewer seed iterations with seedDepth=. (There is more about this in the Arioc User Guide.)

Finally: my own takeaway from this episode is that the maxJ= parameter ought to be required rather than optional -- a change that will be incorporated into the next distribution build.

--rw

aedera commented 1 year ago

Thanks for the explanations.

Now it is working. However, AriocP still hangs when 90% of the SRR770598 reads were aligned. I left AriocP running for several hours but it can not pass the 90%. Other thing is that AriocP stays there but some computation seems to be going on since the GPU is working. For all the experiments, I am using adapters-free reads.

This is the output that I am getting when using the conf file that you shared above.

AriocP file.cfg
201232790 [0036cb95] AriocP v1.51.3129.22314 (release) +2022-11-10T17:27:27
201232795 [0036cb95] Copyright (c) 2015-2022 Johns Hopkins University.  All rights reserved.
201232795 [0036cb95]  compiled             : 2022-11-11T11:14:17 (GNU g++ v9.4.0)
201232795 [0036cb95]  data type sizes      : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5 (little-endian)
201232795 [0036cb95]  computer name        : camus
201232795 [0036cb95]  operating system     : Ubuntu 20.04.4 LTS
201232795 [0036cb95]  executable file      : /mnt/ssd1/aedera/ba-nvbs/ba-hg38.m/ba-arioc.m/testing_new_version_of_arioc.m/ca-mapping.m/AriocP
201232795 [0036cb95]  configuration file   : /mnt/ssd1/aedera/ba-nvbs/ba-hg38.m/ba-arioc.m/testing_new_version_of_arioc.m/ca-mapping.m/file.cfg
201233197 [0036cb95] AriocBase::loadR: load R from disk...
201235245 [0036cb95] AriocBase::loadR: loaded 24 R sequences: 2353273816 bytes (2.2GB) from 48 files in 2048ms (1095.8MB/s)
201235245 [0036cb95] AriocBase::loadHJ: load H and J LUTs from disk...
201235245 [0036cba3] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 5368709120 bytes, ... ) on thread 3591075
201238247 [0036cba3] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
201242857 [0036cba4] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 29387864440 bytes, ... ) on thread 3591076
201254634 [0036cba4] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
201314370 [0036cba5] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 5368709120 bytes, ... ) on thread 3591077
201316333 [0036cba5] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
201320094 [0036cba6] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 29407356656 bytes, ... ) on thread 3591078
201332528 [0036cba6] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
201352846 [0036cb95] AriocBase::loadHJ: loaded H and J LUTs from disk in 77601ms (854.5MB/s)
201352846 [0036cba7] void tuWatchdog::main: watchdog interval 0s
201354659 [0036cba8] AriocBase::UpdateProgress:   0%: 0 pairs (0 mates) aligned
202026556 [0036cba8] AriocBase::UpdateProgress:  10%: 18592920 pairs (37185840 mates) aligned (ETA 71 minutes)
202713663 [0036cba8] AriocBase::UpdateProgress:  20%: 37185840 pairs (74371680 mates) aligned (ETA 58 minutes)
203345481 [0036cba8] AriocBase::UpdateProgress:  30%: 55778760 pairs (111557520 mates) aligned (ETA 49 minutes)
204035319 [0036cba8] AriocBase::UpdateProgress:  40%: 74371680 pairs (148743360 mates) aligned (ETA 42 minutes)
204715992 [0036cba8] AriocBase::UpdateProgress:  50%: 92964600 pairs (185929200 mates) aligned (ETA 34 minutes)
205409601 [0036cba8] AriocBase::UpdateProgress:  60%: 111557520 pairs (223115040 mates) aligned (ETA 27 minutes)
210121821 [0036cba8] AriocBase::UpdateProgress:  70%: 130150440 pairs (260300880 mates) aligned (ETA 20 minutes)
210756510 [0036cba8] AriocBase::UpdateProgress:  80%: 148743360 pairs (297486720 mates) aligned (ETA 13 minutes)
211440744 [0036cba8] AriocBase::UpdateProgress:  90%: 167336280 pairs (334672560 mates) aligned (ETA 6 minutes)
RWilton commented 1 year ago

I would troubleshoot this as follows:

1 - Please download and compile build v1.51.3130. That won't fix anything, but it will ensure that we are both using the same code base.

2 - When you compile the Arioc source code, do a debug build (not a release build). This basically means including the debug parameter in the make command line.

3 - Add verboseMask="0xE0004007" to the <AriocP> element in the configuration file. (Look in the Arioc User Guide for an explanation.)

4 - Specify a watchdog timeout interval in the AriocP configuration file, e.g., <X watchdogInterval="60" />. This <X> element should be the first child element of the <AriocP> element, that is, it should appear after <AriocP> and before <R>.

5a - When AriocP is running and before it hangs, use the Linux top or htop utility to observe AriocP's CPU utilization, and use nvidia-smi to look at GPU activity.

5b - Do the same after the program hangs.

6 - At this point, the watchdog should tell us where the program is stuck. Wait a few minutes to be sure it stays stuck, and then kill the program.

7 - Check that you have not run out of disk space.

Please send me copies of everything (config files, output logs, screen snapshots of top and nvidia-smi)!

There is a section on troubleshooting in the User Guide that might also be helpful.

Finally, if you have the time, you might run the same WGBS sample through Bismark just to get an idea of how many proper mappings it can find and how long it takes to find them.

My instinct is that the reads at the tail end of the input data might still be consuming just enough compute resources to cause AriocP to hang. If so, the "fix" would be to specify even smaller values for batchSize and/or maxJ as well as to incorporate some kind of error-handling code in Arioc that would cause it to fail informatively instead of hang. But I would first try to understand better what is going on here -- particularly since you are using Ubuntu which has a long-standing problem with multithreaded high-performance computing applications.

aedera commented 1 year ago

I followed the instructions.

Here, it is the version of AriocP used (after compiled it with the debug mode)

114733136 [0000a1f2] AriocP v1.51.3130.22316 (debug) +2022-11-12T15:42:47
114733140 [0000a1f2] Copyright (c) 2015-2022 Johns Hopkins University.  All rights reserved.
114733140 [0000a1f2]  compiled             : 2022-11-15T10:14:24 (GNU g++ v9.4.0)
114733140 [0000a1f2]  data type sizes      : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5 (little-endian)
114733140 [0000a1f2]  computer name        : camus
114733140 [0000a1f2]  operating system     : Ubuntu 20.04.4 LTS
114733140 [0000a1f2]  executable file      : /mnt/ssd2/aedera/ba-nvbs/ba-hg38.m/ba-arioc.m/testing_new_version_of_arioc.m/ca-mapping.m/AriocP
114733140 [0000a1f2] ApplicationException ([0x00041458] AriocCommon/AppGlobalCommon.cpp 175): cannot open configuration file './AriocP.cfg'
114733140 [0000a1f2] AriocP ends (1)

This is the conf file

<?xml version="1.0" encoding="utf-8"?>

<AriocP gpuMask="0x00000003" batchSize="16k" verboseMask="0xE0004007">
  <X watchdogInterval="60" />
  <R>../../../aa-data.m/ref/1_encoded</R>

  <nongapped seed="ssi84_2_30_CT" maxJ="200"/>
  <gapped seed="hsi25_0_30_CT" Wmxgs="2_6_5_3" Vt="L,-25,2" maxJ="200" seedDepth="3"/>

  <Q filePath="../ba-encode.m">
    <paired srcId="0" subId="0">
      <file>SRR770598_1</file>
      <file>SRR770598_2</file>
    </paired>
  </Q>
  <A overwrite="true" cigarFormat="MIDS">
    <sam report="cdru">.</sam>
  </A>
</AriocP>

Here the AriocP's output

104143347 [000e3b52] AriocP v1.51.3130.22316 (debug) +2022-11-12T15:42:47
104143351 [000e3b52] Copyright (c) 2015-2022 Johns Hopkins University.  All rights reserved.
104143351 [000e3b52]  compiled             : 2022-11-15T10:14:24 (GNU g++ v9.4.0)
104143351 [000e3b52]  data type sizes      : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5 (little-endian)
104143351 [000e3b52]  computer name        : camus
104143351 [000e3b52]  operating system     : Ubuntu 20.04.4 LTS
104143351 [000e3b52]  executable file      : /mnt/ssd2/aedera/ba-nvbs/ba-hg38.m/ba-arioc.m/testing_new_version_of_arioc.m/ca-mapping.m/AriocP
104143351 [000e3b52]  configuration file   : /mnt/ssd2/aedera/ba-nvbs/ba-hg38.m/ba-arioc.m/testing_new_version_of_arioc.m/ca-mapping.m/file.cfg
104149392 [000e3b52]  max CPU threads      : *
104149392 [000e3b52]  number of GPUs       : 2 (mask: 0x00000003)
104149392 [000e3b52]  batch size           : 16k (16384)
104149392 [000e3b52]  R sequence path      : ../../../aa-data.m/ref/1_encoded
104149392 [000e3b52]   DNA base conversion : CT
104149392 [000e3b52]  nongapped aligner:
104149393 [000e3b52]   seed                : ssi84_2_30_CT
104149393 [000e3b52]    hashSeedWidth      : 84
104149393 [000e3b52]    spacedSeedWeight   : 48
104149393 [000e3b52]    maxMismatches      : 2
104149393 [000e3b52]    hashKeyWidth       : 30
104149393 [000e3b52]   big-bucket threshold: 200
104149393 [000e3b52]   maxJ                : 200
104149393 [000e3b52]   seedCoverage        : *
104149393 [000e3b52]  gapped aligner:
104149393 [000e3b52]   seed                : hsi25_0_30_CT
104149393 [000e3b52]    hashSeedWidth      : 25
104149393 [000e3b52]    hashKeyWidth       : 30
104149393 [000e3b52]   W scores/penalties  : 2_6_5_3
104149393 [000e3b52]    Wm (match)         : 2
104149393 [000e3b52]    Wx (mismatch)      : 6
104149393 [000e3b52]    Wg (gap)           : 5
104149393 [000e3b52]    Ws (gap space)     : 3
104149393 [000e3b52]    Wn (N)             : 6
104149393 [000e3b52]    Wc (clip penalty)  : 0
104149393 [000e3b52]    Vt (threshold)     : 2*N - 25
104149393 [000e3b52]   big-bucket threshold: 200
104149393 [000e3b52]   maxJ                : 200
104149393 [000e3b52]   AtO                 : 1
104149393 [000e3b52]   seedDepth           : 3
104149393 [000e3b52]   minPosSep           : 5
104149393 [000e3b52]   mapqVersion         : 0 (Arioc default)
104149393 [000e3b52]   XM soft clip context: excluded
104149393 [000e3b52]   XM BQS threshold    : 0
104149393 [000e3b52]  paired-end reads
104149393 [000e3b52]   orientation         : convergent
104149393 [000e3b52]   collisions          : overlap,cover
104149393 [000e3b52]   fragment length     : 0-1000
104149393 [000e3b52]   reporting           : unmapped,concordant,discordant,rejected
104149393 [000e3b52]   maxAperRead         : 1
104149393 [000e3b52]   seed rev complement : true
104149393 [000e3b52]  Q sequences (reads)  : (2 files, paired)
104149393 [000e3b52]   nominal sub rate    : 0.25%
104149393 [000e3b52]   00000.000 mate 1    : ../ba-encode.m/SRR770598_1$a21.sbf
104149393 [000e3b52]   00000.000 mate 2    : ../ba-encode.m/SRR770598_2$a21.sbf
104149393 [000e3b52]  A (alignment output) : (1 path, 1073741824 mappings per output file, output file base name: AriocP)
104149393 [000e3b52]   SAM                 : . (unmapped,concordant,discordant,rejected)
104149393 [000e3b52]   cigarFormat         : MIDS
104149393 [000e3b52]   mdFormat            : standard
104149393 [000e3b52]  X parameters
104149393 [000e3b52]   watchdogInterval    : 60 (0x000000000000003c)
104149615 [000e3b52] AriocBase::loadR: load R from disk...
104154114 [000e3b52] AriocBase::loadR: loaded 24 R sequences: 2353273816 bytes (2.2GB) from 48 files in 4498ms (498.9MB/s)
104154114 [000e3b52] AriocBase::loadHJ: load H and J LUTs from disk...
104154114 [000e3b62] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 5368709120 bytes, ... ) on thread 932706
104157258 [000e3b62] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
104203519 [000e3b62] tuLoadLUT<T>::main: loaded 5368709120 bytes (5.0GB) in 9405ms (544.4MB/s) from ../../../aa-data.m/ref/1_encoded/ssi84_2_30_CT/H230.sbf
104203519 [000e3b66] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 29387864440 bytes, ... ) on thread 932710
104215718 [000e3b66] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
104242438 [000e3b66] tuLoadLUT<T>::main: loaded 29387864440 bytes (27.4GB) in 38919ms (720.1MB/s) from ../../../aa-data.m/ref/1_encoded/ssi84_2_30_CT/J230.sbf
104242438 [000e3b7b] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 5368709120 bytes, ... ) on thread 932731
104246528 [000e3b7b] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
104253417 [000e3b7b] tuLoadLUT<T>::main: loaded 5368709120 bytes (5.0GB) in 10978ms (466.4MB/s) from ../../../aa-data.m/ref/1_encoded/hsi25_0_30_CT/H030.sbf
104253417 [000e3b7c] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 29407356656 bytes, ... ) on thread 932732
104311383 [000e3b7c] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
104341603 [000e3b7c] tuLoadLUT<T>::main: loaded 29407356655 bytes (27.4GB) in 48186ms (582.0MB/s) from ../../../aa-data.m/ref/1_encoded/hsi25_0_30_CT/J030.sbf
104341604 [000e3b52] AriocBase::loadHJ: loaded H and J LUTs from disk in 107490ms (616.9MB/s)
104341604 [000e3b52] AriocP::Main: BuffersPerBinding=10 ThreadsPerGPU=10
104341604 [000e3bcd] void tuWatchdog::main: watchdog interval 60s
104341642 [000e3bce] CudaDeviceBinding::CudaDeviceBinding: CUDA device 0 (NVIDIA RTX A6000): CC 3.7/8.6; CPU thread 0x000E3BCE
104341648 [000e3bcf] CudaDeviceBinding::CudaDeviceBinding: CUDA device 1 (NVIDIA RTX A6000): CC 3.7/8.6; CPU thread 0x000E3BCF
104343450 [000e3bcf] tuGpu::loadR: initialized R buffer (2353273816 bytes (2.2GB)) at 0x00007fd618000000 in CUDA global memory
104343450 [000e3bce] tuGpu::loadR: initialized R buffer (2353273816 bytes (2.2GB)) at 0x00007fd6a6000000 in CUDA global memory
104343450 [000e3bcf] tuGpu::loadLUT: initialized Hn buffer (5368709120 bytes (5.0GB)) at 0x00007fe634000000 for device 1 (ordinal 1) in CUDA pinned memory
104343450 [000e3bce] tuGpu::loadLUT: initialized Hn buffer (5368709120 bytes (5.0GB)) at 0x00007fe634000000 for device 0 (ordinal 0) in CUDA pinned memory
104343450 [000e3bcf] tuGpu::loadLUT: initialized Jn buffer (29387864440 bytes (27.4GB)) at 0x00007fdf5c000000 for device 1 (ordinal 1) in CUDA pinned memory
104343450 [000e3bce] tuGpu::loadLUT: initialized Jn buffer (29387864440 bytes (27.4GB)) at 0x00007fdf5c000000 for device 0 (ordinal 0) in CUDA pinned memory
104343450 [000e3bcf] tuGpu::loadLUT: initialized Hg buffer (5368709120 bytes (5.0GB)) at 0x00007fde1c000000 for device 1 (ordinal 1) in CUDA pinned memory
104343450 [000e3bce] tuGpu::loadLUT: initialized Hg buffer (5368709120 bytes (5.0GB)) at 0x00007fde1c000000 for device 0 (ordinal 0) in CUDA pinned memory
104343450 [000e3bcf] tuGpu::loadLUT: initialized Jg buffer (29407356656 bytes (27.4GB)) at 0x00007fd742000000 for device 1 (ordinal 1) in CUDA pinned memory
104343450 [000e3bce] tuGpu::loadLUT: initialized Jg buffer (29407356656 bytes (27.4GB)) at 0x00007fd742000000 for device 0 (ordinal 0) in CUDA pinned memory
104343450 [000e3bce] AriocBase::UpdateProgress:   0%: 0 pairs (0 mates) aligned
104532616 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
104635624 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
104730630 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
104835638 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
104939658 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
105037665 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
105100024 [000e3bce] AriocBase::UpdateProgress:  10%: 18592920 pairs (37185840 mates) aligned (ETA 83 minutes)
105143673 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
105248679 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
105353686 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
105457693 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
105555699 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
105658705 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
105703227 [000e3bce] AriocBase::UpdateProgress:  20%: 37185840 pairs (74371680 mates) aligned (ETA 61 minutes)
105804712 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
105908718 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
110011725 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
110105593 [000e3bce] AriocBase::UpdateProgress:  30%: 55778760 pairs (111557520 mates) aligned (ETA 45 minutes)
110114733 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
110209739 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
110308746 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
110410751 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
110450746 [000e3bce] AriocBase::UpdateProgress:  40%: 74371680 pairs (148743360 mates) aligned (ETA 34 minutes)
110514762 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
110609765 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
110704771 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
110810777 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
110905782 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
110908136 [000e3bce] AriocBase::UpdateProgress:  50%: 92964600 pairs (185929200 mates) aligned (ETA 27 minutes)
111005793 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
111108799 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
111211806 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
111314814 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
111409819 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
111442006 [000e3bce] AriocBase::UpdateProgress:  60%: 111557520 pairs (223115040 mates) aligned (ETA 21 minutes)
111512824 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
111617831 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
111721837 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
111826845 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
111928851 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
111958791 [000e3bce] AriocBase::UpdateProgress:  70%: 130150440 pairs (260300880 mates) aligned (ETA 16 minutes)
112032858 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
112136865 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
112239872 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
112340878 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
112435884 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
112539892 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
112546499 [000e3bce] AriocBase::UpdateProgress:  80%: 148743360 pairs (297486720 mates) aligned (ETA 11 minutes)
112639899 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
112742907 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
112846913 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
112942918 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
113040926 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
113138933 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
113244941 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
113347949 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
113449956 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
113452866 [000e3bce] AriocBase::UpdateProgress:  90%: 167336280 pairs (334672560 mates) aligned (ETA 5 minutes)
113547963 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
113644969 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
113750977 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
113853985 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
113948990 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
114051997 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
114147004 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
114253014 [000e3bcd] watchdog stuck: 000e3bcf(932815):1@100
114349023 [000e3bcd] watchdog stuck: 000e3bce(932814):0@420 000e3bcf(932815):1@100
114451033 [000e3bcd] watchdog stuck: 000e3bce(932814):0@420 000e3bcf(932815):1@100
114552044 [000e3bcd] watchdog stuck: 000e3bce(932814):0@420 000e3bcf(932815):1@100
114654056 [000e3bcd] watchdog stuck: 000e3bce(932814):0@420 000e3bcf(932815):1@100
114752066 [000e3bcd] watchdog stuck: 000e3bce(932814):0@420 000e3bcf(932815):1@100
114858078 [000e3bcd] watchdog stuck: 000e3bce(932814):0@420 000e3bcf(932815):1@100
114957088 [000e3bcd] watchdog stuck: 000e3bce(932814):0@420 000e3bcf(932815):1@100

Now AriocP is hanged and consuming the GPU

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02    Driver Version: 510.85.02    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    Off  | 00000000:17:00.0 Off |                  Off |
| 30%   42C    P8    22W / 300W |  48676MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000    Off  | 00000000:73:00.0 Off |                  Off |
| 30%   43C    P8    25W / 300W |  48666MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2512      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A      7331      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A    932690      C   AriocP                          48533MiB |
|    1   N/A  N/A      2512      G   /usr/lib/xorg/Xorg                 39MiB |
|    1   N/A  N/A      7331      G   /usr/lib/xorg/Xorg                142MiB |
|    1   N/A  N/A      7462      G   /usr/bin/gnome-shell               43MiB |
|    1   N/A  N/A      7747      G   ...bexec/gnome-initial-setup       28MiB |
|    1   N/A  N/A      8355      G   gnome-control-center               28MiB |
|    1   N/A  N/A      8838      G   /usr/lib/firefox/firefox          103MiB |
|    1   N/A  N/A    932690      C   AriocP                          48131MiB |
+-----------------------------------------------------------------------------+

I double checked that there is sufficient disk space. Also I am seeing that when AriocP reaches the 90% of progress it is not updated anymore the output SAM file.

RWilton commented 1 year ago

Thank you for taking the time to provide this troubleshooting information.

The SMI report shows a combination of graphical and compute applications running on that machine. The AriocP output log shows that the "hang" is actually occurring on GPU 1 almost immediately. This implies that there is a resource conflict between AriocP and one of the other applications executing concurrently on that GPU. GPU 0 actually does all of the alignment work except for the batch of reads that is stuck in GPU 1, but GPU 1 is obviously going nowhere so Arioc never terminates.

I have no direct experience with the particular combination of GPU applications you are running, but I have an easy-to-try suggestion before we go any further.

AriocP supports a runtime configuration parameter that limits usage of on-device memory. Since X, gnome, and firefox are consuming 383 megabytes of GPU memory on device 1, why don't you try something like the following:

<X watchdogTimeout="60" cgaReserved="512M" />

The cgaReserved= parameter tells AriocP to allocate 512 megabytes less than the total amount of available free memory on each GPU device. (Please leave the watchdogTimeout= parameter unchanged so we can see whether the device remains stuck.)

My assumption here, of course, is that the limiting resource here is GPU memory and that there is some additional consumption of GPU memory that is not visible in the SMI report. There is no way to prove this, of course, but we have seen similar CUDA memory-allocation misbehavior in the past. (That's why Arioc implements this configuration parameter.)

Nevertheless, I am not confident that this will be a "fix" here.

Our experience with Arioc is that this kind of problem is best avoided by running the GPUs in TCC mode and using another device for video output. Apart from eliminating inter-application resource conflicts, using TCC mode should give you better compute performance overall. (And you should be able to obtain excellent performance with two 48G RTX A6000 devices!)

But let's save performance tuning for later. First, let us determine whether limiting Arioc's aggressive use of GPU memory will resolve this problem quickly.

aedera commented 1 year ago

Thanks for all details and suggestions. I could not make AriocP work even when reserving up to 2GB.

202356694 [001d0c52] AriocP v1.51.3130.22316 (debug) +2022-11-12T15:42:47
202356698 [001d0c52] Copyright (c) 2015-2022 Johns Hopkins University.  All rights reserved.
202356698 [001d0c52]  compiled             : 2022-11-15T10:14:24 (GNU g++ v9.4.0)
202356698 [001d0c52]  data type sizes      : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5 (little-endian)
202356698 [001d0c52]  computer name        : camus
202356698 [001d0c52]  operating system     : Ubuntu 20.04.4 LTS
202356698 [001d0c52]  executable file      : /mnt/ssd2/aedera/ba-nvbs/ba-hg38.m/ba-arioc.m/testing_new_version_of_arioc.m/ca-mapping.m/AriocP
202356698 [001d0c52]  configuration file   : /mnt/ssd2/aedera/ba-nvbs/ba-hg38.m/ba-arioc.m/testing_new_version_of_arioc.m/ca-mapping.m/file.cfg
202357093 [001d0c52]  max CPU threads      : *
202357093 [001d0c52]  number of GPUs       : 2 (mask: 0x00000003)
202357093 [001d0c52]  batch size           : 16k (16384)
202357093 [001d0c52]  R sequence path      : ../../../aa-data.m/ref/1_encoded
202357093 [001d0c52]   DNA base conversion : CT
202357093 [001d0c52]  nongapped aligner:
202357093 [001d0c52]   seed                : ssi84_2_30_CT
202357093 [001d0c52]    hashSeedWidth      : 84
202357093 [001d0c52]    spacedSeedWeight   : 48
202357093 [001d0c52]    maxMismatches      : 2
202357093 [001d0c52]    hashKeyWidth       : 30
202357093 [001d0c52]   big-bucket threshold: 200
202357093 [001d0c52]   maxJ                : 200
202357093 [001d0c52]   seedCoverage        : *
202357093 [001d0c52]  gapped aligner:
202357093 [001d0c52]   seed                : hsi25_0_30_CT
202357093 [001d0c52]    hashSeedWidth      : 25
202357093 [001d0c52]    hashKeyWidth       : 30
202357093 [001d0c52]   W scores/penalties  : 2_6_5_3
202357093 [001d0c52]    Wm (match)         : 2
202357093 [001d0c52]    Wx (mismatch)      : 6
202357093 [001d0c52]    Wg (gap)           : 5
202357093 [001d0c52]    Ws (gap space)     : 3
202357093 [001d0c52]    Wn (N)             : 6
202357093 [001d0c52]    Wc (clip penalty)  : 0
202357093 [001d0c52]    Vt (threshold)     : 2*N - 25
202357093 [001d0c52]   big-bucket threshold: 200
202357093 [001d0c52]   maxJ                : 200
202357093 [001d0c52]   AtO                 : 1
202357093 [001d0c52]   seedDepth           : 3
202357093 [001d0c52]   minPosSep           : 5
202357093 [001d0c52]   mapqVersion         : 0 (Arioc default)
202357093 [001d0c52]   XM soft clip context: excluded
202357093 [001d0c52]   XM BQS threshold    : 0
202357093 [001d0c52]  paired-end reads
202357093 [001d0c52]   orientation         : convergent
202357093 [001d0c52]   collisions          : overlap,cover
202357093 [001d0c52]   fragment length     : 0-1000
202357093 [001d0c52]   reporting           : unmapped,concordant,discordant,rejected
202357093 [001d0c52]   maxAperRead         : 1
202357093 [001d0c52]   seed rev complement : true
202357093 [001d0c52]  Q sequences (reads)  : (2 files, paired)
202357093 [001d0c52]   nominal sub rate    : 0.25%
202357093 [001d0c52]   00000.000 mate 1    : ../ba-encode.m/SRR770598_1$a21.sbf
202357093 [001d0c52]   00000.000 mate 2    : ../ba-encode.m/SRR770598_2$a21.sbf
202357093 [001d0c52]  A (alignment output) : (1 path, 1073741824 mappings per output file, output file base name: AriocP)
202357093 [001d0c52]   SAM                 : . (unmapped,concordant,discordant,rejected)
202357093 [001d0c52]   cigarFormat         : MIDS
202357093 [001d0c52]   mdFormat            : standard
202357093 [001d0c52]  X parameters
202357093 [001d0c52]   cgaReserved         : 2147483648 (0x0000000080000000)
202357093 [001d0c52]   watchdogInterval    : 60 (0x000000000000003c)
202357141 [001d0c52] AriocBase::loadR: load R from disk...
202359684 [001d0c52] AriocBase::loadR: loaded 24 R sequences: 2353273816 bytes (2.2GB) from 48 files in 2543ms (882.5MB/s)
202359684 [001d0c52] AriocBase::loadHJ: load H and J LUTs from disk...
202359684 [001d0c61] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 5368709120 bytes, ... ) on thread 1903713
202402816 [001d0c61] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
202404830 [001d0c61] tuLoadLUT<T>::main: loaded 5368709120 bytes (5.0GB) in 5145ms (995.1MB/s) from ../../../aa-data.m/ref/1_encoded/ssi84_2_30_CT/H230.sbf
202404830 [001d0c62] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 29387864440 bytes, ... ) on thread 1903714
202415739 [001d0c62] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
202425903 [001d0c62] tuLoadLUT<T>::main: loaded 29387864440 bytes (27.4GB) in 21073ms (1330.0MB/s) from ../../../aa-data.m/ref/1_encoded/ssi84_2_30_CT/J230.sbf
202425903 [001d0c63] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 5368709120 bytes, ... ) on thread 1903715
202428925 [001d0c63] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
202431006 [001d0c63] tuLoadLUT<T>::main: loaded 5368709120 bytes (5.0GB) in 5103ms (1003.3MB/s) from ../../../aa-data.m/ref/1_encoded/hsi25_0_30_CT/H030.sbf
202431006 [001d0c64] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 29407356656 bytes, ... ) on thread 1903716
202444127 [001d0c64] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
202456309 [001d0c64] tuLoadLUT<T>::main: loaded 29407356655 bytes (27.4GB) in 25302ms (1108.4MB/s) from ../../../aa-data.m/ref/1_encoded/hsi25_0_30_CT/J030.sbf
202456309 [001d0c52] AriocBase::loadHJ: loaded H and J LUTs from disk in 56625ms (1171.1MB/s)
202456309 [001d0c52] AriocP::Main: BuffersPerBinding=10 ThreadsPerGPU=10
202456309 [001d0c65] void tuWatchdog::main: watchdog interval 60s
202456353 [001d0c66] CudaDeviceBinding::CudaDeviceBinding: CUDA device 0 (NVIDIA RTX A6000): CC 3.7/8.6; CPU thread 0x001D0C66
202456353 [001d0c67] CudaDeviceBinding::CudaDeviceBinding: CUDA device 1 (NVIDIA RTX A6000): CC 3.7/8.6; CPU thread 0x001D0C67
202458120 [001d0c67] tuGpu::loadR: initialized R buffer (2353273816 bytes (2.2GB)) at 0x00007fc46c000000 in CUDA global memory
202458120 [001d0c66] tuGpu::loadR: initialized R buffer (2353273816 bytes (2.2GB)) at 0x00007fc4fa000000 in CUDA global memory
202458121 [001d0c67] tuGpu::loadLUT: initialized Hn buffer (5368709120 bytes (5.0GB)) at 0x00007fd488000000 for device 1 (ordinal 1) in CUDA pinned memory
202458121 [001d0c66] tuGpu::loadLUT: initialized Hn buffer (5368709120 bytes (5.0GB)) at 0x00007fd488000000 for device 0 (ordinal 0) in CUDA pinned memory
202458121 [001d0c67] tuGpu::loadLUT: initialized Jn buffer (29387864440 bytes (27.4GB)) at 0x00007fcdb0000000 for device 1 (ordinal 1) in CUDA pinned memory
202458121 [001d0c66] tuGpu::loadLUT: initialized Jn buffer (29387864440 bytes (27.4GB)) at 0x00007fcdb0000000 for device 0 (ordinal 0) in CUDA pinned memory
202458121 [001d0c67] tuGpu::loadLUT: initialized Hg buffer (5368709120 bytes (5.0GB)) at 0x00007fcc70000000 for device 1 (ordinal 1) in CUDA pinned memory
202458121 [001d0c66] tuGpu::loadLUT: initialized Hg buffer (5368709120 bytes (5.0GB)) at 0x00007fcc70000000 for device 0 (ordinal 0) in CUDA pinned memory
202458121 [001d0c67] tuGpu::loadLUT: initialized Jg buffer (29407356656 bytes (27.4GB)) at 0x00007fc596000000 for device 1 (ordinal 1) in CUDA pinned memory
202458121 [001d0c66] tuGpu::loadLUT: initialized Jg buffer (29407356656 bytes (27.4GB)) at 0x00007fc596000000 for device 0 (ordinal 0) in CUDA pinned memory
202458121 [001d0c66] AriocBase::UpdateProgress:   0%: 0 pairs (0 mates) aligned
202647322 [001d0c65] watchdog stuck: 001d0c67(1903719):1@100
202750328 [001d0c65] watchdog stuck: 001d0c67(1903719):1@100
202845334 [001d0c65] watchdog stuck: 001d0c67(1903719):1@100
202845514 [001d0c66] AriocBase::UpdateProgress:  10%: 18592920 pairs (37185840 mates) aligned (ETA 43 minutes)

I also tried to avoid resource conflicts by running AriocP only on GPU:0, which is not being used by other applications. However, it still hangs:

200903546 [001d08ef]   00000.000 mate 1    : ../ba-encode.m/SRR770598_1$a21.sbf
200903546 [001d08ef]   00000.000 mate 2    : ../ba-encode.m/SRR770598_2$a21.sbf
200903546 [001d08ef]  A (alignment output) : (1 path, 1073741824 mappings per output file, output file base name: AriocP)
200903546 [001d08ef]   SAM                 : . (unmapped,concordant,discordant,rejected)
200903546 [001d08ef]   cigarFormat         : MIDS
200903546 [001d08ef]   mdFormat            : standard
200903546 [001d08ef]  X parameters
200903546 [001d08ef]   cgaReserved         : 2147483648 (0x0000000080000000)
200903546 [001d08ef]   watchdogInterval    : 60 (0x000000000000003c)
200903623 [001d08ef] AriocBase::loadR: load R from disk...
200906305 [001d08ef] AriocBase::loadR: loaded 24 R sequences: 2353273816 bytes (2.2GB) from 48 files in 2682ms (836.8MB/s)
200906305 [001d08ef] AriocBase::loadHJ: load H and J LUTs from disk...
200906306 [001d0900] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 5368709120 bytes, ... ) on thread 1902848
200909363 [001d0900] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
200911425 [001d0900] tuLoadLUT<T>::main: loaded 5368709120 bytes (5.0GB) in 5119ms (1000.2MB/s) from ../../../aa-data.m/ref/1_encoded/ssi84_2_30_CT/H230.sbf
200911426 [001d0907] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 29387864440 bytes, ... ) on thread 1902855
200923364 [001d0907] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
200941907 [001d0907] tuLoadLUT<T>::main: loaded 29387864440 bytes (27.4GB) in 30482ms (919.4MB/s) from ../../../aa-data.m/ref/1_encoded/ssi84_2_30_CT/J230.sbf
200941908 [001d0932] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 5368709120 bytes, ... ) on thread 1902898
200944828 [001d0932] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
200947130 [001d0932] tuLoadLUT<T>::main: loaded 5368709120 bytes (5.0GB) in 5222ms (980.5MB/s) from ../../../aa-data.m/ref/1_encoded/hsi25_0_30_CT/H030.sbf
200947130 [001d0939] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 29407356656 bytes, ... ) on thread 1902905
201000017 [001d0939] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
201017680 [001d0939] tuLoadLUT<T>::main: loaded 29407356655 bytes (27.4GB) in 30550ms (918.0MB/s) from ../../../aa-data.m/ref/1_encoded/hsi25_0_30_CT/J030.sbf
201017680 [001d08ef] AriocBase::loadHJ: loaded H and J LUTs from disk in 71375ms (929.1MB/s)
201017681 [001d08ef] AriocP::Main: BuffersPerBinding=10 ThreadsPerGPU=10
201017681 [001d0962] void tuWatchdog::main: watchdog interval 60s
201017710 [001d0963] CudaDeviceBinding::CudaDeviceBinding: CUDA device 0 (NVIDIA RTX A6000): CC 3.7/8.6; CPU thread 0x001D0963
201019256 [001d0963] tuGpu::loadR: initialized R buffer (2353273816 bytes (2.2GB)) at 0x00007f90ae000000 in CUDA global memory
201019256 [001d0963] tuGpu::loadLUT: initialized Hn buffer (5368709120 bytes (5.0GB)) at 0x00007fa034000000 for device 0 (ordinal 0) in CUDA pinned memory
201019256 [001d0963] tuGpu::loadLUT: initialized Jn buffer (29387864440 bytes (27.4GB)) at 0x00007f995c000000 for device 0 (ordinal 0) in CUDA pinned memory
201019256 [001d0963] tuGpu::loadLUT: initialized Hg buffer (5368709120 bytes (5.0GB)) at 0x00007f981c000000 for device 0 (ordinal 0) in CUDA pinned memory
201019256 [001d0963] tuGpu::loadLUT: initialized Jg buffer (29407356656 bytes (27.4GB)) at 0x00007f9142000000 for device 0 (ordinal 0) in CUDA pinned memory
201019256 [001d0963] AriocBase::UpdateProgress:   0%: 0 pairs (0 mates) aligned
201208697 [001d0962] watchdog stuck: 001d0963(1902947):0@100
201311706 [001d0962] watchdog stuck: 001d0963(1902947):0@100
201406714 [001d0962] watchdog stuck: 001d0963(1902947):0@100
201511724 [001d0962] watchdog stuck: 001d0963(1902947):0@100
RWilton commented 1 year ago

This is an interesting problem.

Since AriocP aligns the same data properly with a different hardware and OS configuration, this is unlikely to be a simple error in the code. Unfortunately, the fact that the program fails by "hanging" makes things difficult.

I dislike to bother you with additional troubleshooting/debugging suggestions at this point. Is there a way I could ssh into that machine, put the compiled binary under a debugger (cuda-gdb), and try to pinpoint the code that's failing?

Alternatively, is there a way you can execute Arioc on at least one of those GPUs without colliding with the other applications that are running there? Again, the ideal configuration would configure the GPUs for computation only ("TCC mode") and use some other device for video output. Or perhaps you can install X-windows, gnome, and firefox to use only one of the GPUs in order to leave the other free for computation only.

I don't want to give up on this little problem. Again, I think you will get very good performance from that GPU hardware if you can only convince the Arioc software to run on it!

aedera commented 1 year ago

Sorry, but the server cannot be accessed through ssh from outside of our LAN. I could not configure the GPUs on TCC mode, however I removed all the applications consuming GPU resources. With this setting, AriocP still hangs. These are the outputs

120546958 [0001ddc5] AriocP v1.51.3130.22316 (debug) +2022-11-12T15:42:47
120546962 [0001ddc5] Copyright (c) 2015-2022 Johns Hopkins University.  All rights reserved.
120546962 [0001ddc5]  compiled             : 2022-11-15T10:14:24 (GNU g++ v9.4.0)
120546962 [0001ddc5]  data type sizes      : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5 (little-endian)
120546962 [0001ddc5]  computer name        : camus
120546962 [0001ddc5]  operating system     : Ubuntu 20.04.4 LTS
120546962 [0001ddc5]  executable file      : /mnt/ssd2/aedera/ba-nvbs/ba-hg38.m/ba-arioc.m/testing_new_version_of_arioc.m/ca-mapping.m/AriocP
120546962 [0001ddc5]  configuration file   : /mnt/ssd2/aedera/ba-nvbs/ba-hg38.m/ba-arioc.m/testing_new_version_of_arioc.m/ca-mapping.m/file.cfg
120547377 [0001ddc5]  max CPU threads      : *
120547377 [0001ddc5]  number of GPUs       : 2 (mask: 0x00000003)
120547377 [0001ddc5]  batch size           : 16k (16384)
120547377 [0001ddc5]  R sequence path      : ../../../aa-data.m/ref/1_encoded
120547377 [0001ddc5]   DNA base conversion : CT
120547377 [0001ddc5]  nongapped aligner:
120547377 [0001ddc5]   seed                : ssi84_2_30_CT
120547377 [0001ddc5]    hashSeedWidth      : 84
120547377 [0001ddc5]    spacedSeedWeight   : 48
120547377 [0001ddc5]    maxMismatches      : 2
120547377 [0001ddc5]    hashKeyWidth       : 30
120547377 [0001ddc5]   big-bucket threshold: 200
120547377 [0001ddc5]   maxJ                : 200
120547377 [0001ddc5]   seedCoverage        : *
120547377 [0001ddc5]  gapped aligner:
120547377 [0001ddc5]   seed                : hsi25_0_30_CT
120547377 [0001ddc5]    hashSeedWidth      : 25
120547377 [0001ddc5]    hashKeyWidth       : 30
120547377 [0001ddc5]   W scores/penalties  : 2_6_5_3
120547377 [0001ddc5]    Wm (match)         : 2
120547377 [0001ddc5]    Wx (mismatch)      : 6
120547377 [0001ddc5]    Wg (gap)           : 5
120547377 [0001ddc5]    Ws (gap space)     : 3
120547377 [0001ddc5]    Wn (N)             : 6
120547377 [0001ddc5]    Wc (clip penalty)  : 0
120547377 [0001ddc5]    Vt (threshold)     : 2*N - 25
120547377 [0001ddc5]   big-bucket threshold: 200
120547377 [0001ddc5]   maxJ                : 200
120547377 [0001ddc5]   AtO                 : 1
120547377 [0001ddc5]   seedDepth           : 3
120547377 [0001ddc5]   minPosSep           : 5
120547377 [0001ddc5]   mapqVersion         : 0 (Arioc default)
120547377 [0001ddc5]   XM soft clip context: excluded
120547377 [0001ddc5]   XM BQS threshold    : 0
120547377 [0001ddc5]  paired-end reads
120547377 [0001ddc5]   orientation         : convergent
120547377 [0001ddc5]   collisions          : overlap,cover
120547377 [0001ddc5]   fragment length     : 0-1000
120547377 [0001ddc5]   reporting           : unmapped,concordant,discordant,rejected
120547377 [0001ddc5]   maxAperRead         : 1
120547377 [0001ddc5]   seed rev complement : true
120547377 [0001ddc5]  Q sequences (reads)  : (2 files, paired)
120547377 [0001ddc5]   nominal sub rate    : 0.25%
120547377 [0001ddc5]   00000.000 mate 1    : ../ba-encode.m/SRR770598_1$a21.sbf
120547377 [0001ddc5]   00000.000 mate 2    : ../ba-encode.m/SRR770598_2$a21.sbf
120547377 [0001ddc5]  A (alignment output) : (1 path, 1073741824 mappings per output file, output file base name: AriocP)
120547377 [0001ddc5]   SAM                 : . (unmapped,concordant,discordant,rejected)
120547377 [0001ddc5]   cigarFormat         : MIDS
120547377 [0001ddc5]   mdFormat            : standard
120547377 [0001ddc5]  X parameters
120547377 [0001ddc5]   watchdogInterval    : 60 (0x000000000000003c)
120547391 [0001ddc5] AriocBase::loadR: load R from disk...
120550139 [0001ddc5] AriocBase::loadR: loaded 24 R sequences: 2353273816 bytes (2.2GB) from 48 files in 2749ms (816.4MB/s)
120550139 [0001ddc5] AriocBase::loadHJ: load H and J LUTs from disk...
120550139 [0001ddd7] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 5368709120 bytes, ... ) on thread 122327
120553138 [0001ddd7] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
120555173 [0001ddd7] tuLoadLUT<T>::main: loaded 5368709120 bytes (5.0GB) in 5034ms (1017.1MB/s) from ../../../aa-data.m/ref/1_encoded/ssi84_2_30_CT/H230.sbf
120555174 [0001dde3] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 29387864440 bytes, ... ) on thread 122339
120606124 [0001dde3] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
120625482 [0001dde3] tuLoadLUT<T>::main: loaded 29387864440 bytes (27.4GB) in 30308ms (924.7MB/s) from ../../../aa-data.m/ref/1_encoded/ssi84_2_30_CT/J230.sbf
120625482 [0001de3d] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 5368709120 bytes, ... ) on thread 122429
120628417 [0001de3d] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
120632746 [0001de3d] tuLoadLUT<T>::main: loaded 5368709120 bytes (5.0GB) in 7264ms (704.8MB/s) from ../../../aa-data.m/ref/1_encoded/hsi25_0_30_CT/H030.sbf
120632747 [0001de4e] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 29407356656 bytes, ... ) on thread 122446
120645544 [0001de4e] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
120701414 [0001de4e] tuLoadLUT<T>::main: loaded 29407356655 bytes (27.4GB) in 28668ms (978.3MB/s) from ../../../aa-data.m/ref/1_encoded/hsi25_0_30_CT/J030.sbf
120701415 [0001ddc5] AriocBase::loadHJ: loaded H and J LUTs from disk in 71276ms (930.3MB/s)
120701419 [0001ddc5] AriocP::Main: BuffersPerBinding=10 ThreadsPerGPU=10
120701420 [0001de76] void tuWatchdog::main: watchdog interval 60s
120701462 [0001de77] CudaDeviceBinding::CudaDeviceBinding: CUDA device 0 (NVIDIA RTX A6000): CC 3.7/8.6; CPU thread 0x0001DE77
120701462 [0001de78] CudaDeviceBinding::CudaDeviceBinding: CUDA device 1 (NVIDIA RTX A6000): CC 3.7/8.6; CPU thread 0x0001DE78
120703225 [0001de77] tuGpu::loadR: initialized R buffer (2353273816 bytes (2.2GB)) at 0x00007f9072000000 in CUDA global memory
120703225 [0001de78] tuGpu::loadR: initialized R buffer (2353273816 bytes (2.2GB)) at 0x00007f8fe4000000 in CUDA global memory
120703225 [0001de77] tuGpu::loadLUT: initialized Hn buffer (5368709120 bytes (5.0GB)) at 0x00007fa000000000 for device 0 (ordinal 0) in CUDA pinned memory
120703225 [0001de78] tuGpu::loadLUT: initialized Hn buffer (5368709120 bytes (5.0GB)) at 0x00007fa000000000 for device 1 (ordinal 1) in CUDA pinned memory
120703225 [0001de77] tuGpu::loadLUT: initialized Jn buffer (29387864440 bytes (27.4GB)) at 0x00007f9928000000 for device 0 (ordinal 0) in CUDA pinned memory
120703225 [0001de78] tuGpu::loadLUT: initialized Jn buffer (29387864440 bytes (27.4GB)) at 0x00007f9928000000 for device 1 (ordinal 1) in CUDA pinned memory
120703225 [0001de77] tuGpu::loadLUT: initialized Hg buffer (5368709120 bytes (5.0GB)) at 0x00007f97e8000000 for device 0 (ordinal 0) in CUDA pinned memory
120703225 [0001de78] tuGpu::loadLUT: initialized Hg buffer (5368709120 bytes (5.0GB)) at 0x00007f97e8000000 for device 1 (ordinal 1) in CUDA pinned memory
120703225 [0001de77] tuGpu::loadLUT: initialized Jg buffer (29407356656 bytes (27.4GB)) at 0x00007f910e000000 for device 0 (ordinal 0) in CUDA pinned memory
120703225 [0001de78] tuGpu::loadLUT: initialized Jg buffer (29407356656 bytes (27.4GB)) at 0x00007f910e000000 for device 1 (ordinal 1) in CUDA pinned memory
120703225 [0001de77] AriocBase::UpdateProgress:   0%: 0 pairs (0 mates) aligned
120852431 [0001de76] watchdog stuck: 0001de78(122488):1@100
120955437 [0001de76] watchdog stuck: 0001de78(122488):1@100
121050442 [0001de76] watchdog stuck: 0001de78(122488):1@100
121056994 [0001de77] AriocBase::UpdateProgress:  10%: 18592920 pairs (37185840 mates) aligned (ETA 46 minutes)
121155449 [0001de76] watchdog stuck: 0001de78(122488):1@100
121259454 [0001de76] watchdog stuck: 0001de78(122488):1@100
121357458 [0001de76] watchdog stuck: 0001de78(122488):1@100
121448211 [0001de77] AriocBase::UpdateProgress:  20%: 37185840 pairs (74371680 mates) aligned (ETA 36 minutes)
121503464 [0001de76] watchdog stuck: 0001de78(122488):1@100
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02    Driver Version: 510.85.02    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    Off  | 00000000:17:00.0 Off |                  Off |
| 45%   74C    P2   150W / 300W |  48676MiB / 49140MiB |     58%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000    Off  | 00000000:73:00.0  On |                  Off |
| 30%   45C    P8    21W / 300W |  48666MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    122309      C   AriocP                          48543MiB |
|    1   N/A  N/A    122309      C   AriocP                          48533MiB |
+-----------------------------------------------------------------------------+

The hanging also occurs when AriocP uses a single GPU.

aedera commented 1 year ago

Other thing I have observed it is that AriocP works well when it uses a subset of reads. I attached a subset where AriocP hangs but it works well when the last read is removed (@SRR770598.938) SRR770598_2.fastq.gz SRR770598_1.fastq.gz

RWilton commented 1 year ago

Thank you for digging into this.

I will have a look at the subset of reads. Perhaps that last read crosses a resource utilization threshold of some sort, or perhaps AriocP is somehow mishandling the read data itself.

RWilton commented 1 year ago

I've now tried this on three different GPUs and in both Windows and Red Hat Linux, but I have not yet reproduced the error.

This is the kind of thing that gives software developers gray hair.

The direct way to debugging this would be to run AriocP in a debugger and see where it's hanging. Unfortunately, I do not have the hardware at hand to reproduce your GPU and operating system environment.

I can still think of two things you could try yourself that should not take more than a few minutes of your time and that might help us discover the problem:

1- Please recompile Arioc with CUDA_CC=86 (e.g., make AriocP debug CUDA_CC=86). This is the compute capability associated with RTX A6000, and compiling a binary with this instruction set should eliminate any potential problem with the compute capability 3.7 compilation (the default in the AriocP makefile) you are currently using.

Then re-execute AriocP with the tiny 686-pair subset of SRR770598 with which you were able to cause AriocP to fail, using only one GPU (e.g., gpuMask="0x0001").

2- Assuming that the failure persists: please re-execute AriocP with the 686-pair subset that fails, and use verboseMask="0xE0007F1F". Do this twice: once with and once without the "problematic" read that apparently elicits the error. This will generate a detailed debugging trace, so please do this only with your tiny subset of reads. Be sure to redirect AriocP's output to a file (e.g., \path\to\AriocP AriocP.cfg &> AriocP.log).

I will then compare the trace from the two runs and, with luck, be able to infer where the program is getting stuck.

RWilton commented 1 year ago

Since I cannot reproduce this error, I will close this issue for now.

I will reopen it if I have an opportunity to reproduce the error, or if you would like to investigate it further.