Training hang - Githubissues

aleSuglia commented 3 years ago

Hello @linjieli222,

I'm trying to train a model for VideoQA but I obtain the following error:

[1,0]<stderr>:Stalled ranks:
[1,0]<stderr>:1: [allgather.noname.1]
[1,0]<stderr>:[2021-07-20 09:23:22.936726: W horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
[1,0]<stderr>:07/20/2021 09:31:42 - INFO - __main__ -   122039 samples loaded
[1,0]<stderr>:A process has executed an operation involving a call
[1,0]<stderr>:to the fork() system call to create a child process.
[1,0]<stderr>:
[1,0]<stderr>:As a result, the libfabric EFA provider is operating in
[1,0]<stderr>:a condition that could result in memory corruption or
[1,0]<stderr>:other system errors.
[1,0]<stderr>:
[1,0]<stderr>:For the libfabric EFA provider to work safely when fork()
[1,0]<stderr>:is called, the application must handle memory registrations
[1,0]<stderr>:(FI_MR_LOCAL) and you will need to set the following environment
[1,0]<stderr>:variables:
[1,0]<stderr>:          RDMAV_FORK_SAFE=1
[1,0]<stderr>:MPI applications do not support this mode.
[1,0]<stderr>:
[1,0]<stderr>:However, this setting can result in signficant performance
[1,0]<stderr>:impact to your application due to increased cost of memory
[1,0]<stderr>:registration.
[1,0]<stderr>:
[1,0]<stderr>:You may want to check with your application vendor to see
[1,0]<stderr>:if an application-level alternative (of not using fork)
[1,0]<stderr>:exists.
[1,0]<stderr>:
[1,0]<stderr>:Please refer to https://github.com/ofiwg/libfabric/issues/6332
[1,0]<stderr>:for more information.
[1,0]<stderr>:
[1,0]<stderr>:Your job will now abort.

This happens immediately after the script loads the data. I can see the logging info [1,0]<stderr>:07/20/2021 09:31:42 - INFO - __main__ - 122039 samples loaded. Can you please advise?

Otherwise, would you have a trained model for VideoQA that I can test?

UPDATE: I've also tried with single GPU (by removing horovodrun) and the same error happens.

linjieli222 commented 3 years ago

I have never seen this error before. Perhaps follow this link (https://github.com/ofiwg/libfabric/issues/6332) for more information.

aleSuglia commented 3 years ago

Yeah very weird one. Not sure where this is coming from to be honest. I would leave it here because in case somebody else comes across this.

aleSuglia commented 3 years ago

I've tried setting the environment variable FI_EFA_FORK_SAFE=1 but it still produces the same error.

wzamazon commented 3 years ago

Hi, Just noticed this issue. If you are still having problem, you need to not only set the environment variable FI_EFA_FORK_SAFE=1, but also use newer version of EFA installer (current version is 1.13.0)

VALUE-Leaderboard / StarterCode

Training hang #4