Haiyang-W / GiT

[ECCV2024 Oral🔥] Official Implementation of "GiT: Towards Generalist Vision Transformer through Universal Language Interface"
https://arxiv.org/abs/2403.09394
Apache License 2.0
293 stars 12 forks source link

torch.distributed.elastic.multiprocessing.errors.ChildFailedError in tools/train.py #4

Closed zsh513 closed 6 months ago

zsh513 commented 6 months ago

I try to train the single detection task with "bash tools/dist_train.sh configs/GiT/single_detection_base.py 1 --work-dir output/test"

image

but facing the above error

Could i ask you how to solve the problem?

Thank you!

Haiyang-W commented 6 months ago

I suspect it might be a system or communication issue on your end, not a problem with our code here. You could try testing it on a standard server.

zsh513 commented 6 months ago

Thank you, i debug and find that the error "SyntaxError: Failed to format the config file" in train.py line123 "runner = Runner.from_cfg(cfg)" caused the ChildFailedError, and i follow https://github.com/open-mmlab/mmdetection/issues/10974 to solve this.

Haiyang-W commented 6 months ago

Thank you, i debug and find that the error "SyntaxError: Failed to format the config file" in train.py line123 "runner = Runner.from_cfg(cfg)" caused the ChildFailedError, and i follow open-mmlab/mmdetection#10974 to solve this.

Nice! Hope you have a good time with our codes. :)

Haiyang-W commented 6 months ago

For the fast training, I recommend you use 672 resolution here. You can also use bfloat16, will 2 times faster than fp32.

zsh513 commented 6 months ago

Thank you very much for your project and suggestions!

Haiyang-W commented 6 months ago

Thank you very much for your project and suggestions!

No problem, wish you all the best. We will also strive to integrate support for bfloat16 training as soon as possible.