facebookresearch / dino

PyTorch code for Vision Transformers training with the Self-Supervised learning method DINO
Apache License 2.0
6.23k stars 905 forks source link

Single gpu training #3

Closed pelletierlab closed 3 years ago

pelletierlab commented 3 years ago

Hi,

I'm just wondering, is there a way to train this on a single GPU without distributed launch?

Best, Jason

iperov commented 3 years ago

I started it with 1 gpu on windows.

utils.py

dist.init_process_group(
        backend="gloo",   #<-- change to gloo
        init_method=args.dist_url,
        world_size=args.world_size,
        rank=args.rank,
    )

args.gpu = 0 # forcing gpu index , because by default it detected wrong index
del D:/somefile.asd

python -m torch.distributed.launch --nproc_per_node=1 main_dino.py --dist_url "file://D:/somefile.asd" --arch deit_small --data_path "faces" --output_dir "dino_save"

if out of memory occur, decrease --batch_size_per_gpu

woctezuma commented 3 years ago

This is going to take a long time.

From the README, vanilla requires ~ 2 days with 8 GPUs:

Run DINO with DeiT-small network on a single node with 8 GPUs for 100 epochs with the following command. Training time is 1.75 day and the resulting checkpoint should reach ~69.3% on k-NN eval and ~73.8% on linear eval.

and, boosted model requires ~ 3 days with 16 GPUs, for 3x more epochs with 2x more GPUs:

You can improve the performance of the vanilla run by: [...] -> training for more epochs:--epochs 300 [...] The resulting pretrained model should reach ~73.4% on k-NN eval and ~76.1% on linear eval. Training time is 2.6 days with 16 GPUs.

Salah856 commented 3 years ago

interested

mathildecaron31 commented 3 years ago

Hi @pelletierlab To train on 1 GPU I run python -m torch.distributed.launch --nproc_per_node=1 main_dino.py --data_path /path/to/imagenet/train --output_dir /path/to/saving_dir

To have faster runs, you could use --arch deit_tiny architecture instead of --arch deit_small

mathildecaron31 commented 3 years ago

See https://github.com/facebookresearch/dino/commit/534f37f000a10afff97c0d96ec4df81875193699

Now you should be able to run on 1 gpu directly with the following command: python main_dino.py --data_path /path/to/imagenet/train --output_dir /path/to/saving_dir

However I still recommend to use torch.distributed.launch

ramdhan1989 commented 3 years ago

I am using windows and pytorch version 1.5.0 only have 1 GPU. I tried suggestions above but got error below : I run this python -m torch.distributed.launch --nproc_per_node=1 main_dino.py --data_path C:/Users/Owner/shopee/product_detection/train/train --output_dir checkpoints

Traceback (most recent call last):
  File "main_dino.py", line 461, in <module>
    train_dino(args)
  File "main_dino.py", line 131, in train_dino
    utils.init_distributed_mode(args)
  File "D:\Ramdhan\SSL\dino-main\utils.py", line 456, in init_distributed_mode
    dist.init_process_group(
AttributeError: module 'torch.distributed' has no attribute 'init_process_group'
Traceback (most recent call last):
  File "C:\Users\Owner\Anaconda3\envs\nlp\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\Owner\Anaconda3\envs\nlp\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\Owner\Anaconda3\envs\nlp\lib\site-packages\torch\distributed\launch.py", line 263, in <module>
    main()
  File "C:\Users\Owner\Anaconda3\envs\nlp\lib\site-packages\torch\distributed\launch.py", line 259, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['C:\\Users\\Owner\\Anaconda3\\envs\\nlp\\python.exe', '-u', 'main_dino.py', '--local_rank=0', '--data_path', 'C:/Users/Owner/shopee/product_detection/train/train', '--output_dir', 'checkpoints']' returned non-zero exit status 1.

please advise, is there anyway to run it on windows with 1 GPU ?

LLL-YUE commented 1 year ago

Hi @pelletierlab To train on 1 GPU I run python -m torch.distributed.launch --nproc_per_node=1 main_dino.py --data_path /path/to/imagenet/train --output_dir /path/to/saving_dir

To have faster runs, you could use --arch deit_tiny architecture instead of --arch deit_small

I tried this command but got error RuntimeError: No rendezvous handler for env:// Could you tell me how to solve this problem? Thank you!