Closed pelletierlab closed 3 years ago
I started it with 1 gpu on windows.
utils.py
dist.init_process_group(
backend="gloo", #<-- change to gloo
init_method=args.dist_url,
world_size=args.world_size,
rank=args.rank,
)
args.gpu = 0 # forcing gpu index , because by default it detected wrong index
del D:/somefile.asd
python -m torch.distributed.launch --nproc_per_node=1 main_dino.py --dist_url "file://D:/somefile.asd" --arch deit_small --data_path "faces" --output_dir "dino_save"
if out of memory occur, decrease --batch_size_per_gpu
This is going to take a long time.
From the README, vanilla requires ~ 2 days with 8 GPUs:
Run DINO with DeiT-small network on a single node with 8 GPUs for 100 epochs with the following command. Training time is 1.75 day and the resulting checkpoint should reach ~69.3% on k-NN eval and ~73.8% on linear eval.
and, boosted model requires ~ 3 days with 16 GPUs, for 3x more epochs with 2x more GPUs:
You can improve the performance of the vanilla run by: [...] -> training for more epochs:
--epochs 300
[...] The resulting pretrained model should reach ~73.4% on k-NN eval and ~76.1% on linear eval. Training time is 2.6 days with 16 GPUs.
interested
Hi @pelletierlab
To train on 1 GPU I run
python -m torch.distributed.launch --nproc_per_node=1 main_dino.py --data_path /path/to/imagenet/train --output_dir /path/to/saving_dir
To have faster runs, you could use --arch deit_tiny
architecture instead of --arch deit_small
See https://github.com/facebookresearch/dino/commit/534f37f000a10afff97c0d96ec4df81875193699
Now you should be able to run on 1 gpu directly with the following command:
python main_dino.py --data_path /path/to/imagenet/train --output_dir /path/to/saving_dir
However I still recommend to use torch.distributed.launch
I am using windows and pytorch version 1.5.0 only have 1 GPU. I tried suggestions above but got error below :
I run this python -m torch.distributed.launch --nproc_per_node=1 main_dino.py --data_path C:/Users/Owner/shopee/product_detection/train/train --output_dir checkpoints
Traceback (most recent call last):
File "main_dino.py", line 461, in <module>
train_dino(args)
File "main_dino.py", line 131, in train_dino
utils.init_distributed_mode(args)
File "D:\Ramdhan\SSL\dino-main\utils.py", line 456, in init_distributed_mode
dist.init_process_group(
AttributeError: module 'torch.distributed' has no attribute 'init_process_group'
Traceback (most recent call last):
File "C:\Users\Owner\Anaconda3\envs\nlp\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "C:\Users\Owner\Anaconda3\envs\nlp\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\Owner\Anaconda3\envs\nlp\lib\site-packages\torch\distributed\launch.py", line 263, in <module>
main()
File "C:\Users\Owner\Anaconda3\envs\nlp\lib\site-packages\torch\distributed\launch.py", line 259, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['C:\\Users\\Owner\\Anaconda3\\envs\\nlp\\python.exe', '-u', 'main_dino.py', '--local_rank=0', '--data_path', 'C:/Users/Owner/shopee/product_detection/train/train', '--output_dir', 'checkpoints']' returned non-zero exit status 1.
please advise, is there anyway to run it on windows with 1 GPU ?
Hi @pelletierlab To train on 1 GPU I run
python -m torch.distributed.launch --nproc_per_node=1 main_dino.py --data_path /path/to/imagenet/train --output_dir /path/to/saving_dir
To have faster runs, you could use
--arch deit_tiny
architecture instead of--arch deit_small
I tried this command but got error RuntimeError: No rendezvous handler for env:// Could you tell me how to solve this problem? Thank you!
Hi,
I'm just wondering, is there a way to train this on a single GPU without distributed launch?
Best, Jason