Open jieWANGforwork opened 3 years ago
Hi!
I met this error when reproducing VQA task, could you please have a loook and give me some suggestion based on your experience? Thanks a lot!
mpirun noticed that process rank 0 with PID 0 on node 1a62e574072d exited on signal 7 (Bus error).
Hi!
I met this error when reproducing VQA task, could you please have a loook and give me some suggestion based on your experience? Thanks a lot!
0%| | 0/6000 [00:00<?, ?it/s][1,0]:08/15/2021 10:08:00 - INFO - main - Running training with 4 GPUs
[1,0]:08/15/2021 10:08:00 - INFO - main - Num examples = 471128
[1,0]:08/15/2021 10:08:00 - INFO - main - Batch size = 1024
[1,0]:08/15/2021 10:08:00 - INFO - main - Accumulate steps = 5
[1,0]:08/15/2021 10:08:00 - INFO - main - Num steps = 6000
[1,0]:[1a62e574072d:00334] Process received signal
[1,0]:[1a62e574072d:00334] Signal: Bus error (7)
[1,0]:[1a62e574072d:00334] Signal code: Non-existant physical address (2)
[1,0]:[1a62e574072d:00334] Failing at address: 0x7f246888f00a
[1,0]:[1a62e574072d:00334] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f25b870a390]
[1,0]:[1a62e574072d:00334] [ 1] /opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(+0x128e0)[0x7f25aa5988e0]
[1,0]:[1a62e574072d:00334] [ 2] /opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(+0x12b74)[0x7f25aa598b74]
[1,0]:[1a62e574072d:00334] [ 3] /opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(+0x14ba5)[0x7f25aa59aba5]
[1,0]:[1a62e574072d:00334] [ 4] /opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(mdb_get+0xbc)[0x7f25aa59b40c]
[1,0]:[1a62e574072d:00334] [ 5] [1,0]:/opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(+0x9d9d)[0x7f25aa58fd9d]
[1,0]:[1a62e574072d:00334] [ 6] python(_PyCFunction_FastCallDict+0x154)[0x55f2e08c1744]
[1,0]:[1a62e574072d:00334] [ 7] [1,0]:python(+0x19842c)[0x55f2e094842c]
[1,0]:[1a62e574072d:00334] [ 8] python(_PyEval_EvalFrameDefault+0x30a)[0x55f2e096d38a]
[1,0]:[1a62e574072d:00334] [ 9] [1,0]:python(_PyFunction_FastCallDict+0x11b)[0x55f2e0942bab]
[1,0]:[1a62e574072d:00334] [10] python(_PyObject_FastCallDict+0x26f)[0x55f2e08c1b0f]
[1,0]:[1a62e574072d:00334] [11] [1,0]:python(_PyObject_Call_Prepend+0x63)[0x55f2e08c66a3]
[1,0]:[1a62e574072d:00334] [12] python(PyObject_Call+0x3e)[0x55f2e08c154e]
[1,0]:[1a62e574072d:00334] [13] [1,0]:python(+0x16b50a)[0x55f2e091b50a]
[1,0]:[1a62e574072d:00334] [14] python(_PyEval_EvalFrameDefault+0x877)[0x55f2e096d8f7]
[1,0]:[1a62e574072d:00334] [15] [1,0]:python(_PyFunction_FastCallDict+0x11b)[0x55f2e0942bab]
[1,0]:[1a62e574072d:00334] [16] python(_PyObject_FastCallDict+0x26f)[0x55f2e08c1b0f]
[1,0]:[1a62e574072d:00334] [17] python(_PyObject_Call_Prepend+0x63)[0x55f2e08c66a3]
[1,0]:[1a62e574072d:00334] [18] [1,0]:python(PyObject_Call+0x3e)[0x55f2e08c154e]
[1,0]:[1a62e574072d:00334] [19] python(+0x16b50a)[0x55f2e091b50a]
[1,0]:[1a62e574072d:00334] [1,0]:[20] python(_PyEval_EvalFrameDefault+0x877)[0x55f2e096d8f7]
[1,0]:[1a62e574072d:00334] [21] [1,0]:python(+0x19253b)[0x55f2e094253b]
[1,0]:[1a62e574072d:00334] [22] python(+0x198505)[0x55f2e0948505]
[1,0]:[1a62e574072d:00334] [23] [1,0]:python(_PyEval_EvalFrameDefault+0x30a)[0x55f2e096d38a]
[1,0]:[1a62e574072d:00334] [24] python(+0x191a76)[0x55f2e0941a76]
[1,0]:[1a62e574072d:00334] [25] python(_PyFunction_FastCallDict+0x1bc)[0x55f2e0942c4c]
[1,0]:[1a62e574072d:00334] [26] [1,0]:python(_PyObject_FastCallDict+0x26f)[0x55f2e08c1b0f]
[1,0]:[1a62e574072d:00334] [27] python(_PyObject_Call_Prepend+0x63)[0x55f2e08c66a3]
[1,0]:[1a62e574072d:00334] [28] [1,0]:python(PyObject_Call+0x3e)[0x55f2e08c154e]
[1,0]:[1a62e574072d:00334] [29] python(+0x16b50a)[0x55f2e091b50a]
[1,0]:[1a62e574072d:00334] End of error message
[1,2]:[1a62e574072d:00336] Process received signal
[1,2]:[1a62e574072d:00336] Signal: Bus error (7)
[1,2]:[1a62e574072d:00336] Signal code: Non-existant physical address (2)
[1,2]:[1a62e574072d:00336] Failing at address: 0x7f56e824f00a
[1,3]:[1a62e574072d:00337] Process received signal
[1,3]:[1a62e574072d:00337] Signal: Bus error (7)
[1,3]:[1a62e574072d:00337] Signal code: Non-existant physical address (2)
[1,3]:[1a62e574072d:00337] Failing at address: 0x7fd82888f00a
[1,3]:[1a62e574072d:00337] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7fd96b06d390]
[1,3]:[1a62e574072d:00337] [ 1] [1,3]:/opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(+0x128e0)[0x7fd955ee38e0]
[1,3]:[1a62e574072d:00337] [ 2] /opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(+0x12b74)[0x7fd955ee3b74]
[1,3]:[1a62e574072d:00337] [ 3] /opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(+0x14ba5)[0x7fd955ee5ba5]
[1,3]:[1a62e574072d:00337] [ 4] [1,3]:/opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(mdb_get+0xbc)[0x7fd955ee640c]
[1,3]:[1a62e574072d:00337] [ 5] /opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(+0x9d9d)[0x7fd955edad9d]
[1,3]:[1a62e574072d:00337] [ 6] [1,2]:[1a62e574072d:00336] [ 0] [1,3]:python(_PyCFunction_FastCallDict+0x154)[0x5600587e3744]
[1,3]:[1a62e574072d:00337] [ 7] python(+0x19842c)[0x56005886a42c]
[1,3]:[1a62e574072d:00337] [ 8] [1,2]:/lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f582a139390]
[1,2]:[1a62e574072d:00336] [ 1] /opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(+0x128e0)[0x7f58106a88e0]
[1,2]:[1a62e574072d:00336] [ 2] /opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(+0x12b74)[0x7f58106a8b74]
[1,2]:[1a62e574072d:00336] [ 3] /opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(+0x14ba5)[0x7f58106aaba5]
[1,2]:[1a62e574072d:00336] [ 4] /opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(mdb_get+0xbc)[0x7f58106ab40c]
[1,2]:[1a62e574072d:00336] [ 5] /opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(+0x9d9d)[0x7f581069fd9d]
[1,2]:[1a62e574072d:00336] [ 6] [1,3]:python(_PyEval_EvalFrameDefault+0x30a)[0x56005888f38a]
[1,3]:[1a62e574072d:00337] [ 9] python(_PyFunction_FastCallDict+0x11b)[0x560058864bab]
[1,3]:[1a62e574072d:00337] [10] [1,2]:python(_PyCFunction_FastCallDict+0x154)[0x558e13464744]
[1,2]:[1a62e574072d:00336] [ 7] python(+0x19842c)[0x558e134eb42c]
[1,2]:[1a62e574072d:00336] [ 8] [1,3]:python(_PyObject_FastCallDict+0x26f)[0x5600587e3b0f]
[1,3]:[1a62e574072d:00337] [11] [1,2]:python(_PyEval_EvalFrameDefault+0x30a)[0x558e1351038a]
[1,2]:[1a62e574072d:00336] [ 9] python(_PyFunction_FastCallDict+0x11b)[0x558e134e5bab]
[1,2]:[1a62e574072d:00336] [10] [1,3]:python(_PyObject_Call_Prepend+0x63)[0x5600587e86a3]
[1,3]:[1a62e574072d:00337] [12] python(PyObject_Call+0x3e)[0x5600587e354e]
[1,3]:[1a62e574072d:00337] [13] [1,2]:python(_PyObject_FastCallDict+0x26f)[0x558e13464b0f]
[1,2]:[1a62e574072d:00336] [11] [1,3]:python(+0x16b50a)[0x56005883d50a]
[1,3]:[1a62e574072d:00337] [14] [1,2]:python(_PyObject_Call_Prepend+0x63)[0x558e134696a3]
[1,2]:[1a62e574072d:00336] [12] [1,3]:python(_PyEval_EvalFrameDefault+0x877)[0x56005888f8f7]
[1,3]:[1a62e574072d:00337] [15] [1,2]:python(PyObject_Call+0x3e)[0x558e1346454e]
[1,2]:[1a62e574072d:00336] [13] [1,3]:python(_PyFunction_FastCallDict+0x11b)[0x560058864bab]
[1,3]:[1a62e574072d:00337] [16] [1,2]:python(+0x16b50a)[0x558e134be50a]
[1,2]:[1a62e574072d:00336] [14] [1,3]:python(_PyObject_FastCallDict+0x26f)[0x5600587e3b0f]
[1,3]:[1a62e574072d:00337] [17] python(_PyObject_Call_Prepend+0x63)[0x5600587e86a3]
[1,3]:[1a62e574072d:00337] [18] [1,2]:python(_PyEval_EvalFrameDefault+0x877)[0x558e135108f7]
[1,2]:[1a62e574072d:00336] [15] python(_PyFunction_FastCallDict+0x11b)[0x558e134e5bab]
[1,3]:python(PyObject_Call+0x3e)[0x5600587e354e]
[1,3]:[1a62e574072d:00337] [19] [1,2]:[1a62e574072d:00336] [16] python(_PyObject_FastCallDict+0x26f)[0x558e13464b0f]
[1,2]:[1a62e574072d:00336] [17] [1,3]:python(+0x16b50a)[0x56005883d50a]
[1,3]:[1a62e574072d:00337] [20] [1,2]:python(_PyObject_Call_Prepend+0x63)[0x558e134696a3]
[1,2]:[1a62e574072d:00336] [18] [1,3]:python(_PyEval_EvalFrameDefault+0x877)[0x56005888f8f7]
[1,3]:[1a62e574072d:00337] [21] [1,2]:python(PyObject_Call+0x3e)[0x558e1346454e]
[1,2]:[1a62e574072d:00336] [19] [1,3]:python(+0x19253b)[0x56005886453b]
[1,3]:[1a62e574072d:00337] [22] python(+0x198505)[0x56005886a505]
[1,3]:[1a62e574072d:00337] [23] [1,2]:python(+0x16b50a)[0x558e134be50a]
[1,2]:[1a62e574072d:00336] [20] python(_PyEval_EvalFrameDefault+0x877)[0x558e135108f7]
[1,2]:[1a62e574072d:00336] [21] [1,3]:python(_PyEval_EvalFrameDefault+0x30a)[0x56005888f38a]
[1,3]:[1a62e574072d:00337] [24] python(+0x191a76)[0x560058863a76]
[1,3]:[1a62e574072d:00337] [25] [1,2]:python(+0x19253b)[0x558e134e553b]
[1,2]:[1a62e574072d:00336] [22] [1,3]:python(_PyFunction_FastCallDict+0x1bc)[0x560058864c4c]
[1,3]:[1a62e574072d:00337] [26] [1,2]:python(+0x198505)[0x558e134eb505]
[1,2]:[1a62e574072d:00336] [23] [1,3]:python(_PyObject_FastCallDict+0x26f)[0x5600587e3b0f]
[1,3]:[1a62e574072d:00337] [27] [1,2]:python(_PyEval_EvalFrameDefault+0x30a)[0x558e1351038a]
[1,2]:[1a62e574072d:00336] [24] python(+0x191a76)[0x558e134e4a76]
[1,3]:python(_PyObject_Call_Prepend+0x63)[0x5600587e86a3]
[1,3]:[1a62e574072d:00337] [28] [1,2]:[1a62e574072d:00336] [25] python(_PyFunction_FastCallDict+0x1bc)[0x558e134e5c4c]
[1,2]:[1a62e574072d:00336] [26] [1,3]:python(PyObject_Call+0x3e)[0x5600587e354e]
[1,3]:[1a62e574072d:00337] [29] python(+0x16b50a)[0x56005883d50a]
[1,2]:python(_PyObject_FastCallDict+0x26f)[0x558e13464b0f]
[1,2]:[1a62e574072d:00336] [27] python(_PyObject_Call_Prepend+0x63)[0x558e134696a3]
[1,2]:[1a62e574072d:00336] [28] [1,3]:[1a62e574072d:00337] End of error message
[1,2]:python(PyObject_Call+0x3e)[0x558e1346454e]
[1,2]:[1a62e574072d:00336] [29] [1,2]:python(+0x16b50a)[0x558e134be50a]
[1,2]:[1a62e574072d:00336] End of error message
[1,1]:[1a62e574072d:00335] Process received signal
[1,1]:[1a62e574072d:00335] Signal: Bus error (7)
[1,1]:[1a62e574072d:00335] Signal code: Non-existant physical address (2)
[1,1]:[1a62e574072d:00335] Failing at address: 0x7fd41730e00a
[1,1]:[1a62e574072d:00335] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7fd5591ad390]
[1,1]:[1a62e574072d:00335] [ 1] /opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(+0x128e0)[0x7fd44f81f8e0]
[1,1]:[1a62e574072d:00335] [1,1]:[ 2] /opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(+0x12b74)[0x7fd44f81fb74]
[1,1]:[1a62e574072d:00335] [ 3] /opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(+0x14ba5)[0x7fd44f821ba5]
[1,1]:[1a62e574072d:00335] [ 4] /opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(mdb_get+0xbc)[0x7fd44f82240c]
[1,1]:[1a62e574072d:00335] [ 5] /opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(+0x9d9d)[0x7fd44f816d9d]
[1,1]:[1a62e574072d:00335] [ 6] [1,1]:python(_PyCFunction_FastCallDict+0x154)[0x55e2b8f02744]
[1,1]:[1a62e574072d:00335] [ 7] [1,1]:python(+0x19842c)[0x55e2b8f8942c]
[1,1]:[1a62e574072d:00335] [ 8] [1,1]:python(_PyEval_EvalFrameDefault+0x30a)[0x55e2b8fae38a]
[1,1]:[1a62e574072d:00335] [ 9] [1,1]:python(_PyFunction_FastCallDict+0x11b)[0x55e2b8f83bab]
[1,1]:[1a62e574072d:00335] [10] [1,1]:python(_PyObject_FastCallDict+0x26f)[0x55e2b8f02b0f]
[1,1]:[1a62e574072d:00335] [11] [1,1]:python(_PyObject_Call_Prepend+0x63)[0x55e2b8f076a3]
[1,1]:[1a62e574072d:00335] [12] [1,1]:python(PyObject_Call+0x3e)[0x55e2b8f0254e]
[1,1]:[1a62e574072d:00335] [13] [1,1]:python(+0x16b50a)[0x55e2b8f5c50a]
[1,1]:[1a62e574072d:00335] [14] [1,1]:python(_PyEval_EvalFrameDefault+0x877)[0x55e2b8fae8f7]
[1,1]:[1a62e574072d:00335] [15] [1,1]:python(_PyFunction_FastCallDict+0x11b)[0x55e2b8f83bab]
[1,1]:[1a62e574072d:00335] [16] [1,1]:python(_PyObject_FastCallDict+0x26f)[0x55e2b8f02b0f]
[1,1]:[1a62e574072d:00335] [17] [1,1]:python(_PyObject_Call_Prepend+0x63)[0x55e2b8f076a3]
[1,1]:[1a62e574072d:00335] [18] [1,1]:python(PyObject_Call+0x3e)[0x55e2b8f0254e]
[1,1]:[1a62e574072d:00335] [19] [1,1]:python(+0x16b50a)[0x55e2b8f5c50a]
[1,1]:[1a62e574072d:00335] [20] [1,1]:python(_PyEval_EvalFrameDefault+0x877)[0x55e2b8fae8f7]
[1,1]:[1a62e574072d:00335] [21] [1,1]:python(+0x19253b)[0x55e2b8f8353b]
[1,1]:[1a62e574072d:00335] [22] [1,1]:python(+0x198505)[0x55e2b8f89505]
[1,1]:[1a62e574072d:00335] [23] [1,1]:python(_PyEval_EvalFrameDefault+0x30a)[0x55e2b8fae38a]
[1,1]:[1a62e574072d:00335] [24] [1,1]:python(+0x191a76)[0x55e2b8f82a76]
[1,1]:[1a62e574072d:00335] [25] [1,1]:python(_PyFunction_FastCallDict+0x1bc)[0x55e2b8f83c4c]
[1,1]:[1a62e574072d:00335] [26] [1,1]:python(_PyObject_FastCallDict+0x26f)[0x55e2b8f02b0f]
[1,1]:[1a62e574072d:00335] [27] [1,1]:python(_PyObject_Call_Prepend+0x63)[0x55e2b8f076a3]
[1,1]:[1a62e574072d:00335] [28] [1,1]:python(PyObject_Call+0x3e)[0x55e2b8f0254e]
[1,1]:[1a62e574072d:00335] [29] [1,1]:python(+0x16b50a)[0x55e2b8f5c50a]
[1,1]:[1a62e574072d:00335] End of error message
Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.
mpirun noticed that process rank 0 with PID 0 on node 1a62e574072d exited on signal 7 (Bus error).