SystemsGenetics / KINC

Knowledge Independent Network Construction
MIT License
11 stars 4 forks source link

--minexpr filter not working #101

Closed SystemsGenetics-0310 closed 4 years ago

SystemsGenetics-0310 commented 5 years ago

Hi,

I ran kinc on Clemson's Palmetto Cluster. I am able to run the entire pipeline using GPUs (select=5:ncpus=20:mpiprocs=20:ngpus=2:gpu_model=k40:mem=120gb,walltime=72:00:00).

In the Similarity step, i have successfully used the --minsamp and --maxcorr filters, however when using the --minexpr filter (with the below command) i get an error.

Command: mpirun -np 200 kinc run similarity --input all.emx --ccm all_thinned.ccm --cmx all_thinned.cmx --clusmethod gmm --minexpr -2

Error:
Error:
mlx5: node1983.palmetto.clemson.edu: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00008e01 1001a5b2 0006a0d2
[node1983:2337 :0:2337] rc_verbs_iface.c:69   send completion with error: local length error
==== backtrace (tid:   2337) ====
 0 0x000000000004b9b8 ucs_fatal_error_message()  /software/src/aantao/ucx-1.6.0/contrib/../src/ucs/debug/assert.c:36
 1 0x000000000004e9ec ucs_log_default_handler()  /software/src/aantao/ucx-1.6.0/contrib/../src/ucs/debug/log.c:140
 2 0x000000000004eb14 ucs_log_dispatch()  /software/src/aantao/ucx-1.6.0/contrib/../src/ucs/debug/log.c:193
 3 0x0000000000023058 uct_rc_verbs_iface_poll_tx()  /software/src/aantao/ucx-1.6.0/contrib/../src/uct/ib/rc/verbs/rc_verbs_iface.c:108
 4 0x0000000000023058 uct_rc_verbs_iface_progress()  /software/src/aantao/ucx-1.6.0/contrib/../src/uct/ib/rc/verbs/rc_verbs_iface.c:136
 5 0x0000000000019102 ucs_callbackq_dispatch()  /software/src/aantao/ucx-1.6.0/contrib/../src/ucs/datastruct/callbackq.h:211
 6 0x0000000000019102 uct_worker_progress()  /software/src/aantao/ucx-1.6.0/contrib/../src/uct/api/uct.h:1806
 7 0x0000000000019102 ucp_worker_progress()  /software/src/aantao/ucx-1.6.0/contrib/../src/ucp/core/ucp_worker.c:1632
 8 0x0000000000003147 mca_pml_ucx_progress()  ???:0
 9 0x00000000000312fc opal_progress()  ???:0
10 0x00000000000033bd mca_pml_ucx_recv()  ???:0
11 0x0000000000083335 PMPI_Recv()  ???:0
12 0x0000000000032fee Ace::QMPI::probe()  ???:0
13 0x00000000000334dd Ace::QMPI::timerEvent()  ???:0
14 0x000000000029b654 QObject::event()  ???:0
15 0x0000000000270b4c QCoreApplication::notify()  ???:0
16 0x000000000001ac11 EApplication::notify()  ???:0
17 0x0000000000270a85 QCoreApplication::notifyInternal2()  ???:0
18 0x00000000002c08fe QTimerInfoList::activateTimers()  ???:0
19 0x00000000002c0f79 QEventDispatcherGlib::flush()  ???:0
20 0x000000000004c049 g_main_context_dispatch()  ???:0
21 0x000000000004c3a8 g_main_context_dispatch()  ???:0
22 0x000000000004c45c g_main_context_iteration()  ???:0
23 0x00000000002c126c QEventDispatcherGlib::processEvents()  ???:0
24 0x000000000026ed7b QEventLoop::exec()  ???:0
25 0x0000000000277344 QCoreApplication::exec()  ???:0
26 0x000000000001c6d0 EApplication::exec()  ???:0
27 0x0000000000414d20 ???()  /home/saski/applications/bin/kinc:0
28 0x0000000000022495 __libc_start_main()  ???:0
29 0x00000000004152ec ???()  /home/saski/applications/bin/kinc:0
=================================
[node1983:02337] *** Process received signal ***
[node1983:02337] Signal: Aborted (6)
[node1983:02337] Signal code:  (-6)
[node1983:02337] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x150b28b515d0]
[node1983:02337] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x150b27ce22c7]
[node1983:02337] [ 2] /lib64/libc.so.6(abort+0x148)[0x150b27ce39b8]
[node1983:02337] [ 3] /software/UCX/gcc-1.6.0/install/lib/libucs.so.0(ucs_fatal_error_message+0x6d)[0x150af161e9bd]
[node1983:02337] [ 4] /software/UCX/gcc-1.6.0/install/lib/libucs.so.0(+0x4e9ec)[0x150af16219ec]
[node1983:02337] [ 5] /software/UCX/gcc-1.6.0/install/lib/libucs.so.0(ucs_log_dispatch+0xc4)[0x150af1621b14]
[node1983:02337] [ 6] /software/UCX/gcc-1.6.0/install/lib/ucx/libuct_ib.so.0(+0x23058)[0x150af0f5d058]
[node1983:02337] [ 7] /software/UCX/gcc-1.6.0/install/lib/libucp.so.0(ucp_worker_progress+0x22)[0x150af1b65102]
[node1983:02337] [ 8] /software/openmpi/3.1.4-gcc71-ucx/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x17)[0x150af1d99147]
[node1983:02337] [ 9] /software/openmpi/3.1.4-gcc71-ucx/lib/libopen-pal.so.40(opal_progress+0x2c)[0x150b26d9f2fc]
[node1983:02337] [10] /software/openmpi/3.1.4-gcc71-ucx/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_recv+0xdd)[0x150af1d993bd]
[node1983:02337] [11] /software/openmpi/3.1.4-gcc71-ucx/lib/libmpi.so.40(PMPI_Recv+0x115)[0x150b2a2d2335]
[node1983:02337] [12] /home/saski/applications/lib/libacecore.so.3(_ZN3Ace4QMPI5probeEP19ompi_communicator_ti+0xfe)[0x150b35733fee]
[node1983:02337] [13] /home/saski/applications/lib/libacecore.so.3(_ZN3Ace4QMPI10timerEventEP11QTimerEvent+0x2d)[0x150b357344dd]
[node1983:02337] [14] /software/Qt/5.9.1/lib/libQt5Core.so.5(_ZN7QObject5eventEP6QEvent+0x64)[0x150b29285654]
[node1983:02337] [15] /software/Qt/5.9.1/lib/libQt5Core.so.5(_ZN16QCoreApplication6notifyEP7QObjectP6QEvent+0x3c)[0x150b2925ab4c]
[node1983:02337] [16] /home/saski/applications/lib/libaceconsole.so.3(_ZN12EApplication6notifyEP7QObjectP6QEvent+0x11)[0x150b29a45c11]
[node1983:02337] [17] /software/Qt/5.9.1/lib/libQt5Core.so.5(_ZN16QCoreApplication15notifyInternal2EP7QObjectP6QEvent+0x75)[0x150b2925aa85]
[node1983:02337] [18] /software/Qt/5.9.1/lib/libQt5Core.so.5(_ZN14QTimerInfoList14activateTimersEv+0x4ce)[0x150b292aa8fe]
[node1983:02337] [19] /software/Qt/5.9.1/lib/libQt5Core.so.5(+0x2c0f79)[0x150b292aaf79]
[node1983:02337] [20] /lib64/libglib-2.0.so.0(g_main_context_dispatch+0x159)[0x150b2406b049]
[node1983:02337] [21] /lib64/libglib-2.0.so.0(+0x4c3a8)[0x150b2406b3a8]
[node1983:02337] [22] /lib64/libglib-2.0.so.0(g_main_context_iteration+0x2c)[0x150b2406b45c]
[node1983:02337] [23] /software/Qt/5.9.1/lib/libQt5Core.so.5(_ZN20QEventDispatcherGlib13processEventsE6QFlagsIN10QEventLoop17ProcessEventsFlagEE+0x5c)[0x150b292ab26c]
[node1983:02337] [24] /software/Qt/5.9.1/lib/libQt5Core.so.5(_ZN10QEventLoop4execE6QFlagsINS_17ProcessEventsFlagEE+0xfb)[0x150b29258d7b]
[node1983:02337] [25] /software/Qt/5.9.1/lib/libQt5Core.so.5(_ZN16QCoreApplication4execEv+0x84)[0x150b29261344]
[node1983:02337] [26] /home/saski/applications/lib/libaceconsole.so.3(_ZN12EApplication4execEv+0x520)[0x150b29a476d0]
[node1983:02337] [27] kinc[0x414d20]
[node1983:02337] [28] /lib64/libc.so.6(__libc_start_main+0xf5)[0x150b27cce495]
[node1983:02337] [29] kinc[0x4152ec]
[node1983:02337] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node node1983 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
4ctrl-alt-del commented 4 years ago

What version of KINC are you running?

SystemsGenetics-0310 commented 4 years ago

KINC-3.3.0

4ctrl-alt-del commented 4 years ago

Please use the latest master branch and try again, there are compile level issues with 3.3.0 that makes it impossible for me to build it on my machine which I will require to debug your issue.

spficklin commented 4 years ago

I think this is resolved with the latest development version of KINC so I'm closing this out. Thanks @cugiuser for reporting. Please comment if you're still having issues.

SystemsGenetics-0310 commented 4 years ago

Thanks Stephen! Correct the new version (latest branch) worked. I will follow up with you all soon on my results. Take care. Chris