Running on specific gpu(s)

SG87 commented 5 years ago

I want to train sentiment discovery on a new dataset using a DGX-1 machine shared machine. For this I can only use a limited set of GPUs. How can I specify to only run the process on GPUs 7 and 8 (6 - 7)?

raulpuric commented 5 years ago

That's actually an interesting use case we haven't addressed.

Normally when we only want to use part of a shared machine we only use GPUs 0,1 or 0,1,2 etc.

This is because the nccl communication library doesn't work properly if you reassign gpu mappings with CUDA_VISIBLE_DEVICES=6,7.

I'll try and push a small update by tomorrow that should allow you to use any contiguous span of GPUs (ie. 6,7 or 4,5,6).

Sorry if this still wouldn't suit your needs.

The last option I would recommend is to run nvidia-docker on your DGX-1 and try to map only the available GPUs you need. This should be able to work out of the box with our multiproc.py script

raulpuric commented 5 years ago

You can now train on GPU 6-7 by running multiproc training with --world_size=2 and --base-gpu=6.

You can also now run training on a particular gpu by setting --base-gpu to that device number (0-indexed).

SG87 commented 5 years ago

I am running now: python3 -m multiproc main.py --data data.csv --world_size=1 --base-gpu=7

Is this the correct command? When I inspect nvidia-smi, I see that I am using all GPU's.

Additionally, I receive the error below:

Traceback (most recent call last): File "main.py", line 415, in <module> Traceback (most recent call last): File "main.py", line 415, in <module> model.load_state_dict(torch.load(args.save, 'cpu')) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 358, in load return _load(f, map_location, pickle_module) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 544, in _load model.load_state_dict(torch.load(args.save, 'cpu')) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 358, in load return _load(f, map_location, pickle_module) deserialized_storage_keys = pickle_module.load(f) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 544, in _load _pickle.UnpicklingError deserialized_storage_keys = pickle_module.load(f) _pickle.UnpicklingError Traceback (most recent call last): File "main.py", line 415, in <module> model.load_state_dict(torch.load(args.save, 'cpu')) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 358, in load return _load(f, map_location, pickle_module) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 544, in _load deserialized_storage_keys = pickle_module.load(f) _pickle.UnpicklingError Traceback (most recent call last): File "main.py", line 415, in <module> model.load_state_dict(torch.load(args.save, 'cpu')) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 358, in load return _load(f, map_location, pickle_module) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 544, in _load deserialized_storage_keys = pickle_module.load(f) _pickle.UnpicklingError Traceback (most recent call last): File "main.py", line 415, in <module> model.load_state_dict(torch.load(args.save, 'cpu')) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 358, in load return _load(f, map_location, pickle_module) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 544, in _load deserialized_storage_keys = pickle_module.load(f) _pickle.UnpicklingError Traceback (most recent call last): File "main.py", line 415, in <module> model.load_state_dict(torch.load(args.save, 'cpu')) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 358, in load return _load(f, map_location, pickle_module) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 544, in _load deserialized_storage_keys = pickle_module.load(f) _pickle.UnpicklingError terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /pytorch/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /pytorch/third_party/gloo/gloo/cuda_private.h:40: driver shutting down terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /pytorch/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /pytorch/third_party/gloo/gloo/cuda_private.h:40: driver shutting down terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /pytorch/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /pytorch/third_party/gloo/gloo/cuda_private.h:40: driver shutting down terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /pytorch/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /pytorch/third_party/gloo/gloo/cuda_private.h:40: driver shutting down terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /pytorch/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /pytorch/third_party/gloo/gloo/cuda_private.h:40: driver shutting down terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /pytorch/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /pytorch/third_party/gloo/gloo/cuda_private.h:40: driver shutting down Traceback (most recent call last): File "main.py", line 415, in <module> model.load_state_dict(torch.load(args.save, 'cpu')) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 358, in load return _load(f, map_location, pickle_module) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 544, in _load deserialized_storage_keys = pickle_module.load(f) _pickle.UnpicklingError terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /pytorch/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /pytorch/third_party/gloo/gloo/cuda_private.h:40: driver shutting down Traceback (most recent call last): File "main.py", line 415, in <module> model.load_state_dict(torch.load(args.save, 'cpu')) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 358, in load return _load(f, map_location, pickle_module) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 544, in _load deserialized_storage_keys = pickle_module.load(f) _pickle.UnpicklingError terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /pytorch/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /pytorch/third_party/gloo/gloo/cuda_private.h:40: driver shutting down

raulpuric commented 5 years ago

Oh sorry you're right.

My argparsing could've been a bit better.

Try python3 -m multiproc main.py --data data.csv --world_size 1 --base-gpu 7 instead.

Also, if you only intend to use 1 gpu the usage of multiproc is unnecessary.

SG87 commented 5 years ago

Thx! Stupid I could not figure out that myself.

If I run the command below, the process in running om GPU 0 and 7 python3 -m multiproc main.py --data data.csv --world_size 2 --base-gpu 6

Where PID=30004 is another task running on the machine. The main idea is preventing the current tasks do not interfere with process 30004.

raulpuric commented 5 years ago

You're right. Just fixed an edge case, so pull and try again.

SG87 commented 5 years ago

You're right. Just fixed an edge case, so pull and try again.

Should I pull master? I get it is up-to-date and I also see no commit.

raulpuric commented 5 years ago

hmm, you're right, it doesn't seem to have pushed properly

raulpuric commented 5 years ago

Ok it should be up now properly

SG87 commented 5 years ago

I tested the code again.

World-size one works perfect: python3 main.py --data data/data.csv --world_size 1 --base-gpu 7

World-size two gives an error: python3 -m multiproc main.py --data data/data.csv --world_size 2 --base-gpu 6 Error:

Traceback (most recent call last): File "main.py", line 402, in model.load_state_dict(torch.load(args.save, 'cpu')) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 358, in load return _load(f, map_location, pickle_module) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 549, in _load deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly) RuntimeError: storage has wrong size: expected 4358612012146265478 got 256 Traceback (most recent call last): File "main.py", line 402, in model.load_state_dict(torch.load(args.save, 'cpu')) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 358, in load return _load(f, map_location, pickle_module) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 549, in _load deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly) RuntimeError: storage has wrong size: expected 4358612012146265478 got 256 terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /pytorch/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /pytorch/third_party/gloo/gloo/cuda_private.h:40: driver shutting down terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /pytorch/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /pytorch/third_party/gloo/gloo/cuda_private.h:40: driver shutting down

SG87 commented 5 years ago

Is it also possible to develop the ability to apply world_size and base-gpu to transfer.py (and clasifier.py)?

raulpuric commented 5 years ago

Yeah for some reason torch's serialization library doesn't work properly for reloading the best model checkpoint for distributed. If it's not too much trouble, I would just try running it again, but skip the training/validation loops (with continue or something).

raulpuric commented 5 years ago

See #48 for a discussion on multigpu in transfer.py

Any chance you have that file uploaded somewhere?

NVIDIA / sentiment-discovery

Running on specific gpu(s) #46