Open MKaczkow opened 7 months ago
you could just remove the lines initializing the DistributedDataParallel
in app/vjepa/train.py
, i.e. lines 295-297, as a quick fix.
It didn't help, I am afraid, still getting:
ValueError: Default process group has not been initialized, please make sure to call init_process_group.
Did you also try to remove it from the eval scripts, i.e. line 201 in evals/image_classification_frozen/eval.py
?
I faced this same issue using a single GPU on one machine, I got it working by changing the port and explicitly defining the rank and world size. For evaluation you can edit line 131 in evals/video_classification_frozen/eval.py
to be
world_size, rank = init_distributed(port=12321, rank_and_world_size=(0, 1))
First of all, thanks for providing this code π
tl;dr
I am getting ValueError when trying to run eval on
iNat21
dataset withpython -m evals.main --fname configs/evals/vitl16_inat.yaml --devices cuda:0
and running out of ideas how to fix it.Config values
iNaturalist-2021
configs\evals\vith16_inat.yaml
look like this:I have tried
torch.distributed
being available, but not initialized, but I haven't been able to pinpoint where this happensDistributedDataParallel
is the root cause, but I haven't found it in the repoSyncBatchNorm
behaving in unexpected way, when running on single GPU, but this has already been fixed in this PRin
evals.main
, to avoid using ofinit_distributed
functionFull stacktrace