Open aleSuglia opened 3 years ago
I have never seen this error before. Perhaps follow this link (https://github.com/ofiwg/libfabric/issues/6332) for more information.
Yeah very weird one. Not sure where this is coming from to be honest. I would leave it here because in case somebody else comes across this.
I've tried setting the environment variable FI_EFA_FORK_SAFE=1
but it still produces the same error.
Hi, Just noticed this issue. If you are still having problem, you need to not only set the environment variable FI_EFA_FORK_SAFE=1
, but also use newer version of EFA installer (current version is 1.13.0)
Hello @linjieli222,
I'm trying to train a model for VideoQA but I obtain the following error:
This happens immediately after the script loads the data. I can see the logging info
[1,0]<stderr>:07/20/2021 09:31:42 - INFO - __main__ - 122039 samples loaded
. Can you please advise?Otherwise, would you have a trained model for VideoQA that I can test?
UPDATE: I've also tried with single GPU (by removing
horovodrun
) and the same error happens.