ChenRocks / UNITER

Research code for ECCV 2020 paper "UNITER: UNiversal Image-TExt Representation Learning"
https://arxiv.org/abs/1909.11740
777 stars 109 forks source link

Reproduce error: Non-existant physical address #82

Open jieWANGforwork opened 3 years ago

jieWANGforwork commented 3 years ago

Hi!

I met this error when reproducing VQA task, could you please have a loook and give me some suggestion based on your experience? Thanks a lot!

0%| | 0/6000 [00:00<?, ?it/s][1,0]:08/15/2021 10:08:00 - INFO - main - Running training with 4 GPUs [1,0]:08/15/2021 10:08:00 - INFO - main - Num examples = 471128 [1,0]:08/15/2021 10:08:00 - INFO - main - Batch size = 1024 [1,0]:08/15/2021 10:08:00 - INFO - main - Accumulate steps = 5 [1,0]:08/15/2021 10:08:00 - INFO - main - Num steps = 6000 [1,0]:[1a62e574072d:00334] Process received signal [1,0]:[1a62e574072d:00334] Signal: Bus error (7) [1,0]:[1a62e574072d:00334] Signal code: Non-existant physical address (2) [1,0]:[1a62e574072d:00334] Failing at address: 0x7f246888f00a [1,0]:[1a62e574072d:00334] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f25b870a390] [1,0]:[1a62e574072d:00334] [ 1] /opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(+0x128e0)[0x7f25aa5988e0] [1,0]:[1a62e574072d:00334] [ 2] /opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(+0x12b74)[0x7f25aa598b74] [1,0]:[1a62e574072d:00334] [ 3] /opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(+0x14ba5)[0x7f25aa59aba5] [1,0]:[1a62e574072d:00334] [ 4] /opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(mdb_get+0xbc)[0x7f25aa59b40c] [1,0]:[1a62e574072d:00334] [ 5] [1,0]:/opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(+0x9d9d)[0x7f25aa58fd9d] [1,0]:[1a62e574072d:00334] [ 6] python(_PyCFunction_FastCallDict+0x154)[0x55f2e08c1744] [1,0]:[1a62e574072d:00334] [ 7] [1,0]:python(+0x19842c)[0x55f2e094842c] [1,0]:[1a62e574072d:00334] [ 8] python(_PyEval_EvalFrameDefault+0x30a)[0x55f2e096d38a] [1,0]:[1a62e574072d:00334] [ 9] [1,0]:python(_PyFunction_FastCallDict+0x11b)[0x55f2e0942bab] [1,0]:[1a62e574072d:00334] [10] python(_PyObject_FastCallDict+0x26f)[0x55f2e08c1b0f] [1,0]:[1a62e574072d:00334] [11] [1,0]:python(_PyObject_Call_Prepend+0x63)[0x55f2e08c66a3] [1,0]:[1a62e574072d:00334] [12] python(PyObject_Call+0x3e)[0x55f2e08c154e] [1,0]:[1a62e574072d:00334] [13] [1,0]:python(+0x16b50a)[0x55f2e091b50a] [1,0]:[1a62e574072d:00334] [14] python(_PyEval_EvalFrameDefault+0x877)[0x55f2e096d8f7] [1,0]:[1a62e574072d:00334] [15] [1,0]:python(_PyFunction_FastCallDict+0x11b)[0x55f2e0942bab] [1,0]:[1a62e574072d:00334] [16] python(_PyObject_FastCallDict+0x26f)[0x55f2e08c1b0f] [1,0]:[1a62e574072d:00334] [17] python(_PyObject_Call_Prepend+0x63)[0x55f2e08c66a3] [1,0]:[1a62e574072d:00334] [18] [1,0]:python(PyObject_Call+0x3e)[0x55f2e08c154e] [1,0]:[1a62e574072d:00334] [19] python(+0x16b50a)[0x55f2e091b50a] [1,0]:[1a62e574072d:00334] [1,0]:[20] python(_PyEval_EvalFrameDefault+0x877)[0x55f2e096d8f7] [1,0]:[1a62e574072d:00334] [21] [1,0]:python(+0x19253b)[0x55f2e094253b] [1,0]:[1a62e574072d:00334] [22] python(+0x198505)[0x55f2e0948505] [1,0]:[1a62e574072d:00334] [23] [1,0]:python(_PyEval_EvalFrameDefault+0x30a)[0x55f2e096d38a] [1,0]:[1a62e574072d:00334] [24] python(+0x191a76)[0x55f2e0941a76] [1,0]:[1a62e574072d:00334] [25] python(_PyFunction_FastCallDict+0x1bc)[0x55f2e0942c4c] [1,0]:[1a62e574072d:00334] [26] [1,0]:python(_PyObject_FastCallDict+0x26f)[0x55f2e08c1b0f] [1,0]:[1a62e574072d:00334] [27] python(_PyObject_Call_Prepend+0x63)[0x55f2e08c66a3] [1,0]:[1a62e574072d:00334] [28] [1,0]:python(PyObject_Call+0x3e)[0x55f2e08c154e] [1,0]:[1a62e574072d:00334] [29] python(+0x16b50a)[0x55f2e091b50a] [1,0]:[1a62e574072d:00334] End of error message [1,2]:[1a62e574072d:00336] Process received signal [1,2]:[1a62e574072d:00336] Signal: Bus error (7) [1,2]:[1a62e574072d:00336] Signal code: Non-existant physical address (2) [1,2]:[1a62e574072d:00336] Failing at address: 0x7f56e824f00a [1,3]:[1a62e574072d:00337] Process received signal [1,3]:[1a62e574072d:00337] Signal: Bus error (7) [1,3]:[1a62e574072d:00337] Signal code: Non-existant physical address (2) [1,3]:[1a62e574072d:00337] Failing at address: 0x7fd82888f00a [1,3]:[1a62e574072d:00337] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7fd96b06d390] [1,3]:[1a62e574072d:00337] [ 1] [1,3]:/opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(+0x128e0)[0x7fd955ee38e0] [1,3]:[1a62e574072d:00337] [ 2] /opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(+0x12b74)[0x7fd955ee3b74] [1,3]:[1a62e574072d:00337] [ 3] /opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(+0x14ba5)[0x7fd955ee5ba5] [1,3]:[1a62e574072d:00337] [ 4] [1,3]:/opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(mdb_get+0xbc)[0x7fd955ee640c] [1,3]:[1a62e574072d:00337] [ 5] /opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(+0x9d9d)[0x7fd955edad9d] [1,3]:[1a62e574072d:00337] [ 6] [1,2]:[1a62e574072d:00336] [ 0] [1,3]:python(_PyCFunction_FastCallDict+0x154)[0x5600587e3744] [1,3]:[1a62e574072d:00337] [ 7] python(+0x19842c)[0x56005886a42c] [1,3]:[1a62e574072d:00337] [ 8] [1,2]:/lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f582a139390] [1,2]:[1a62e574072d:00336] [ 1] /opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(+0x128e0)[0x7f58106a88e0] [1,2]:[1a62e574072d:00336] [ 2] /opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(+0x12b74)[0x7f58106a8b74] [1,2]:[1a62e574072d:00336] [ 3] /opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(+0x14ba5)[0x7f58106aaba5] [1,2]:[1a62e574072d:00336] [ 4] /opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(mdb_get+0xbc)[0x7f58106ab40c] [1,2]:[1a62e574072d:00336] [ 5] /opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(+0x9d9d)[0x7f581069fd9d] [1,2]:[1a62e574072d:00336] [ 6] [1,3]:python(_PyEval_EvalFrameDefault+0x30a)[0x56005888f38a] [1,3]:[1a62e574072d:00337] [ 9] python(_PyFunction_FastCallDict+0x11b)[0x560058864bab] [1,3]:[1a62e574072d:00337] [10] [1,2]:python(_PyCFunction_FastCallDict+0x154)[0x558e13464744] [1,2]:[1a62e574072d:00336] [ 7] python(+0x19842c)[0x558e134eb42c] [1,2]:[1a62e574072d:00336] [ 8] [1,3]:python(_PyObject_FastCallDict+0x26f)[0x5600587e3b0f] [1,3]:[1a62e574072d:00337] [11] [1,2]:python(_PyEval_EvalFrameDefault+0x30a)[0x558e1351038a] [1,2]:[1a62e574072d:00336] [ 9] python(_PyFunction_FastCallDict+0x11b)[0x558e134e5bab] [1,2]:[1a62e574072d:00336] [10] [1,3]:python(_PyObject_Call_Prepend+0x63)[0x5600587e86a3] [1,3]:[1a62e574072d:00337] [12] python(PyObject_Call+0x3e)[0x5600587e354e] [1,3]:[1a62e574072d:00337] [13] [1,2]:python(_PyObject_FastCallDict+0x26f)[0x558e13464b0f] [1,2]:[1a62e574072d:00336] [11] [1,3]:python(+0x16b50a)[0x56005883d50a] [1,3]:[1a62e574072d:00337] [14] [1,2]:python(_PyObject_Call_Prepend+0x63)[0x558e134696a3] [1,2]:[1a62e574072d:00336] [12] [1,3]:python(_PyEval_EvalFrameDefault+0x877)[0x56005888f8f7] [1,3]:[1a62e574072d:00337] [15] [1,2]:python(PyObject_Call+0x3e)[0x558e1346454e] [1,2]:[1a62e574072d:00336] [13] [1,3]:python(_PyFunction_FastCallDict+0x11b)[0x560058864bab] [1,3]:[1a62e574072d:00337] [16] [1,2]:python(+0x16b50a)[0x558e134be50a] [1,2]:[1a62e574072d:00336] [14] [1,3]:python(_PyObject_FastCallDict+0x26f)[0x5600587e3b0f] [1,3]:[1a62e574072d:00337] [17] python(_PyObject_Call_Prepend+0x63)[0x5600587e86a3] [1,3]:[1a62e574072d:00337] [18] [1,2]:python(_PyEval_EvalFrameDefault+0x877)[0x558e135108f7] [1,2]:[1a62e574072d:00336] [15] python(_PyFunction_FastCallDict+0x11b)[0x558e134e5bab] [1,3]:python(PyObject_Call+0x3e)[0x5600587e354e] [1,3]:[1a62e574072d:00337] [19] [1,2]:[1a62e574072d:00336] [16] python(_PyObject_FastCallDict+0x26f)[0x558e13464b0f] [1,2]:[1a62e574072d:00336] [17] [1,3]:python(+0x16b50a)[0x56005883d50a] [1,3]:[1a62e574072d:00337] [20] [1,2]:python(_PyObject_Call_Prepend+0x63)[0x558e134696a3] [1,2]:[1a62e574072d:00336] [18] [1,3]:python(_PyEval_EvalFrameDefault+0x877)[0x56005888f8f7] [1,3]:[1a62e574072d:00337] [21] [1,2]:python(PyObject_Call+0x3e)[0x558e1346454e] [1,2]:[1a62e574072d:00336] [19] [1,3]:python(+0x19253b)[0x56005886453b] [1,3]:[1a62e574072d:00337] [22] python(+0x198505)[0x56005886a505] [1,3]:[1a62e574072d:00337] [23] [1,2]:python(+0x16b50a)[0x558e134be50a] [1,2]:[1a62e574072d:00336] [20] python(_PyEval_EvalFrameDefault+0x877)[0x558e135108f7] [1,2]:[1a62e574072d:00336] [21] [1,3]:python(_PyEval_EvalFrameDefault+0x30a)[0x56005888f38a] [1,3]:[1a62e574072d:00337] [24] python(+0x191a76)[0x560058863a76] [1,3]:[1a62e574072d:00337] [25] [1,2]:python(+0x19253b)[0x558e134e553b] [1,2]:[1a62e574072d:00336] [22] [1,3]:python(_PyFunction_FastCallDict+0x1bc)[0x560058864c4c] [1,3]:[1a62e574072d:00337] [26] [1,2]:python(+0x198505)[0x558e134eb505] [1,2]:[1a62e574072d:00336] [23] [1,3]:python(_PyObject_FastCallDict+0x26f)[0x5600587e3b0f] [1,3]:[1a62e574072d:00337] [27] [1,2]:python(_PyEval_EvalFrameDefault+0x30a)[0x558e1351038a] [1,2]:[1a62e574072d:00336] [24] python(+0x191a76)[0x558e134e4a76] [1,3]:python(_PyObject_Call_Prepend+0x63)[0x5600587e86a3] [1,3]:[1a62e574072d:00337] [28] [1,2]:[1a62e574072d:00336] [25] python(_PyFunction_FastCallDict+0x1bc)[0x558e134e5c4c] [1,2]:[1a62e574072d:00336] [26] [1,3]:python(PyObject_Call+0x3e)[0x5600587e354e] [1,3]:[1a62e574072d:00337] [29] python(+0x16b50a)[0x56005883d50a] [1,2]:python(_PyObject_FastCallDict+0x26f)[0x558e13464b0f] [1,2]:[1a62e574072d:00336] [27] python(_PyObject_Call_Prepend+0x63)[0x558e134696a3] [1,2]:[1a62e574072d:00336] [28] [1,3]:[1a62e574072d:00337] End of error message [1,2]:python(PyObject_Call+0x3e)[0x558e1346454e] [1,2]:[1a62e574072d:00336] [29] [1,2]:python(+0x16b50a)[0x558e134be50a] [1,2]:[1a62e574072d:00336] End of error message [1,1]:[1a62e574072d:00335] Process received signal [1,1]:[1a62e574072d:00335] Signal: Bus error (7) [1,1]:[1a62e574072d:00335] Signal code: Non-existant physical address (2) [1,1]:[1a62e574072d:00335] Failing at address: 0x7fd41730e00a [1,1]:[1a62e574072d:00335] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7fd5591ad390] [1,1]:[1a62e574072d:00335] [ 1] /opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(+0x128e0)[0x7fd44f81f8e0] [1,1]:[1a62e574072d:00335] [1,1]:[ 2] /opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(+0x12b74)[0x7fd44f81fb74] [1,1]:[1a62e574072d:00335] [ 3] /opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(+0x14ba5)[0x7fd44f821ba5] [1,1]:[1a62e574072d:00335] [ 4] /opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(mdb_get+0xbc)[0x7fd44f82240c] [1,1]:[1a62e574072d:00335] [ 5] /opt/conda/lib/python3.6/site-packages/lmdb/cpython.cpython-36m-x86_64-linux-gnu.so(+0x9d9d)[0x7fd44f816d9d] [1,1]:[1a62e574072d:00335] [ 6] [1,1]:python(_PyCFunction_FastCallDict+0x154)[0x55e2b8f02744] [1,1]:[1a62e574072d:00335] [ 7] [1,1]:python(+0x19842c)[0x55e2b8f8942c] [1,1]:[1a62e574072d:00335] [ 8] [1,1]:python(_PyEval_EvalFrameDefault+0x30a)[0x55e2b8fae38a] [1,1]:[1a62e574072d:00335] [ 9] [1,1]:python(_PyFunction_FastCallDict+0x11b)[0x55e2b8f83bab] [1,1]:[1a62e574072d:00335] [10] [1,1]:python(_PyObject_FastCallDict+0x26f)[0x55e2b8f02b0f] [1,1]:[1a62e574072d:00335] [11] [1,1]:python(_PyObject_Call_Prepend+0x63)[0x55e2b8f076a3] [1,1]:[1a62e574072d:00335] [12] [1,1]:python(PyObject_Call+0x3e)[0x55e2b8f0254e] [1,1]:[1a62e574072d:00335] [13] [1,1]:python(+0x16b50a)[0x55e2b8f5c50a] [1,1]:[1a62e574072d:00335] [14] [1,1]:python(_PyEval_EvalFrameDefault+0x877)[0x55e2b8fae8f7] [1,1]:[1a62e574072d:00335] [15] [1,1]:python(_PyFunction_FastCallDict+0x11b)[0x55e2b8f83bab] [1,1]:[1a62e574072d:00335] [16] [1,1]:python(_PyObject_FastCallDict+0x26f)[0x55e2b8f02b0f] [1,1]:[1a62e574072d:00335] [17] [1,1]:python(_PyObject_Call_Prepend+0x63)[0x55e2b8f076a3] [1,1]:[1a62e574072d:00335] [18] [1,1]:python(PyObject_Call+0x3e)[0x55e2b8f0254e] [1,1]:[1a62e574072d:00335] [19] [1,1]:python(+0x16b50a)[0x55e2b8f5c50a] [1,1]:[1a62e574072d:00335] [20] [1,1]:python(_PyEval_EvalFrameDefault+0x877)[0x55e2b8fae8f7] [1,1]:[1a62e574072d:00335] [21] [1,1]:python(+0x19253b)[0x55e2b8f8353b] [1,1]:[1a62e574072d:00335] [22] [1,1]:python(+0x198505)[0x55e2b8f89505] [1,1]:[1a62e574072d:00335] [23] [1,1]:python(_PyEval_EvalFrameDefault+0x30a)[0x55e2b8fae38a] [1,1]:[1a62e574072d:00335] [24] [1,1]:python(+0x191a76)[0x55e2b8f82a76] [1,1]:[1a62e574072d:00335] [25] [1,1]:python(_PyFunction_FastCallDict+0x1bc)[0x55e2b8f83c4c] [1,1]:[1a62e574072d:00335] [26] [1,1]:python(_PyObject_FastCallDict+0x26f)[0x55e2b8f02b0f] [1,1]:[1a62e574072d:00335] [27] [1,1]:python(_PyObject_Call_Prepend+0x63)[0x55e2b8f076a3] [1,1]:[1a62e574072d:00335] [28] [1,1]:python(PyObject_Call+0x3e)[0x55e2b8f0254e] [1,1]:[1a62e574072d:00335] [29] [1,1]:python(+0x16b50a)[0x55e2b8f5c50a] [1,1]:[1a62e574072d:00335] End of error message

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpirun noticed that process rank 0 with PID 0 on node 1a62e574072d exited on signal 7 (Bus error).