SystemsGenetics / KINC

Knowledge Independent Network Construction
MIT License
11 stars 4 forks source link

Segfault for large-scale GMM CPU runs #74

Open bentsherman opened 5 years ago

bentsherman commented 5 years ago

When processing the Rice dataset I get this error at both 256 and 512:

Testing with P = 256...
[node1629:25546] *** Process received signal ***
[node1629:25546] Signal: Segmentation fault (11)
[node1629:25546] Signal code: Invalid permissions (2)
[node1629:25546] Failing at address: 0x2b4f2fdff9a4
[node1629:25546] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x2b4f0eef25d0]
[node1629:25546] [ 1] kinc[0x43b412]
[node1629:25546] [ 2] kinc[0x43bed3]
[node1629:25546] [ 3] kinc[0x43ace5]
[node1629:25546] [ 4] kinc[0x439a97]
[node1629:25546] [ 5] /home/btsheal/software/ACE/3.0.2/lib/libacecore.so.3(_ZN3Ace8Analytic9SerialRun7addWorkEOSt10unique_ptrIN17EAbstractAnalytic5BlockESt14default_deleteIS4_EE+0x22)[0x2b4f0c49bcb2]
[node1629:25546] [ 6] /home/btsheal/software/ACE/3.0.2/lib/libacecore.so.3(_ZN3Ace8Analytic8MPISlave7processERK10QByteArray+0x49)[0x2b4f0c4a01c9]
[node1629:25546] [ 7] /home/btsheal/software/ACE/3.0.2/lib/libacecore.so.3(_ZN3Ace8Analytic8MPISlave12dataReceivedERK10QByteArrayi+0x2b)[0x2b4f0c4a038b]
[node1629:25546] [ 8] /software/Qt/5.9.2/lib/libQt5Core.so.5(_ZN11QMetaObject8activateEP7QObjectiiPPv+0x973)[0x2b4f0e9d8e23]
[node1629:25546] [ 9] /home/btsheal/software/ACE/3.0.2/lib/libacecore.so.3(_ZN3Ace4QMPI12dataReceivedERK10QByteArrayi+0x33)[0x2b4f0c4ab6a3]
[node1629:25546] [10] /home/btsheal/software/ACE/3.0.2/lib/libacecore.so.3(_ZN3Ace4QMPI5probeEP19ompi_communicator_ti+0x120)[0x2b4f0c4821e0]
[node1629:25546] [11] /home/btsheal/software/ACE/3.0.2/lib/libacecore.so.3(_ZN3Ace4QMPI10timerEventEP11QTimerEvent+0x2d)[0x2b4f0c4826cd]
[node1629:25546] [12] /software/Qt/5.9.2/lib/libQt5Core.so.5(_ZN7QObject5eventEP6QEvent+0x64)[0x2b4f0e9d9e94]
[node1629:25546] [13] /software/Qt/5.9.2/lib/libQt5Core.so.5(_ZN16QCoreApplication6notifyEP7QObjectP6QEvent+0x3c)[0x2b4f0e9af38c]
[node1629:25546] [14] /home/btsheal/software/ACE/3.0.2/lib/libaceconsole.so.3(_ZN12EApplication6notifyEP7QObjectP6QEvent+0x16)[0x2b4f0dfd0196]
[node1629:25546] [15] /software/Qt/5.9.2/lib/libQt5Core.so.5(_ZN16QCoreApplication15notifyInternal2EP7QObjectP6QEvent+0x75)[0x2b4f0e9af2c5]
[node1629:25546] [16] /software/Qt/5.9.2/lib/libQt5Core.so.5(_ZN14QTimerInfoList14activateTimersEv+0x4ce)[0x2b4f0e9ff9ee]
[node1629:25546] [17] /software/Qt/5.9.2/lib/libQt5Core.so.5(+0x2c2069)[0x2b4f0ea00069]
[node1629:25546] [18] /lib64/libglib-2.0.so.0(g_main_context_dispatch+0x159)[0x2b4f140774c9]
[node1629:25546] [19] /lib64/libglib-2.0.so.0(+0x4a818)[0x2b4f14077818]
[node1629:25546] [20] /lib64/libglib-2.0.so.0(g_main_context_iteration+0x2c)[0x2b4f140778cc]
[node1629:25546] [21] /software/Qt/5.9.2/lib/libQt5Core.so.5(_ZN20QEventDispatcherGlib13processEventsE6QFlagsIN10QEventLoop17ProcessEventsFlagEE+0x5c)[0x2b4f0ea0035c]
[node1629:25546] [22] /software/Qt/5.9.2/lib/libQt5Core.so.5(_ZN10QEventLoop4execE6QFlagsINS_17ProcessEventsFlagEE+0xfb)[0x2b4f0e9ad5bb]
[node1629:25546] [23] /software/Qt/5.9.2/lib/libQt5Core.so.5(_ZN16QCoreApplication4execEv+0x84)[0x2b4f0e9b5b84]
[node1629:25546] [24] /home/btsheal/software/ACE/3.0.2/lib/libaceconsole.so.3(_ZN12EApplication4execEv+0x714)[0x2b4f0dfd1c94]
[node1629:25546] [25] kinc[0x412e49]
[node1629:25546] [26] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b4f0f9b5495]
[node1629:25546] [27] kinc[0x4132c2]
[node1629:25546] *** End of error message ***

I know that this error doesn't occur at 128 so I'm guessing there's a cutoff somewhere in terms of MPI ranks. Not yet sure if the cause is coming from ACE, KINC, or Palmetto. Will post new information as it becomes available.

spficklin commented 4 years ago

This issue is over 1 year old. Is it still a problem?

bentsherman commented 4 years ago

I haven't taken the time to test KINC with this many MPI ranks since then, but I'm still concerned that it could be a problem. I would like for either myself or someone else test KINC with up to 1024 CPU cores before closing this issue because I want to be sure that KINC and ACE can function properly up to that scale. This error that I got worries me that there is still some barrier to reaching that level of scalability.