jack-willturner commented 3 years ago

We had DistributedDataParallel set instead of DataParallel. I assume we don't want support for distributed training? I'm not even sure how you would go about it on our machines.

This meant that you could use num_gpus_to_use > 1.

I also changed state dict loading to be more idiomatic (i.e. dataparallel modules are unwrapped and saved so that there is no need to overwrite the names of the dictionary keys when reloading).

AntreasAntoniou commented 3 years ago

@jack-willturner DistributedDataParallel was intentional. I found it in an Nvidia cheatsheet. When you use it on a single machine with multiple GPUs, it assigns one CPU core to each GPU being used, as opposed to one CPU core to all GPUs. Therefore producing measurable computational efficiency improvements.

jack-willturner commented 3 years ago

Right OK.

I've never used DistributedDataParallel before, so I have a few questions:

I assume we just use nccl for the backend and hide the choice from the user? Do we have to divide the batch_size by the number of GPUs? Are there any other bits like that to take care of?
I guess Rich printing will no longer work? since there will be multiple workers for the "main" loop
We'd have to wrap the main loop in a function and spawn workers from outside? In that case what do you do if it's CPU-only?
Are there any annoyances with model checkpointing that we'll have to take care of?

AntreasAntoniou commented 3 years ago

Let's keep data parallel as you proposed for now. The distributed data parallel solution takes more time to setup, that I don't have right now. I'll be taking care of it, just not today. Let's leave it at a place where it functions.

jack-willturner commented 3 years ago

Would be good to go through that cheat sheet at a later date and quantify what kinds of improvements each optimisation brings.

AntreasAntoniou commented 3 years ago

Agreed. For reference https://nvlabs.github.io/eccv2020-mixed-precision-tutorial/files/szymon_migacz-pytorch-performance-tuning-guide.pdf

sourcery-ai[bot] commented 3 years ago

Sourcery Code Quality Report

✅ Merging this PR will increase code quality in the affected files by 0.23%.

Quality metrics	Before	After	Change
Complexity	16.81 🙂	18.52 😞	1.71 👎
Method Length	103.41 🙂	102.00 🙂	-1.41 👍
Working memory	17.10 ⛔	16.97 ⛔	-0.13 👍
Quality	45.46% 😞	45.69% 😞	0.23% 👍

Other metrics	Before	After	Change
Lines	628	621	-7

Changed files	Quality Before	Quality After	Quality Change
train.py	29.04% 😞	28.31% 😞	-0.73% 👎
utils/storage.py	68.14% 🙂	71.45% 🙂	3.31% 👍

Here are some functions in these files that still need a tune-up:

File	Function	Complexity	Length	Working Memory	Quality	Recommendation
utils/storage.py	build_experiment_folder	10 🙂	138 😞	12 😞	50.90% 🙂	Try splitting into smaller methods. Extract out complex expressions
train.py	get_base_argument_parser	0 ⭐	232 ⛔	9 🙂	58.32% 🙂	Try splitting into smaller methods
utils/storage.py	download_file	4 ⭐	104 🙂	11 😞	62.58% 🙂	Extract out complex expressions
train.py	train	1 ⭐	95 🙂	11 😞	66.80% 🙂	Extract out complex expressions
utils/storage.py	restore_model	5 ⭐	75 🙂	10 😞	67.64% 🙂	Extract out complex expressions

Legend and Explanation

The emojis denote the absolute quality of the code:

⭐ excellent
🙂 good
😞 poor
⛔ very poor

The 👍 and 👎 indicate whether the quality has improved or gotten worse with this pull request.

Please see our documentation here for details on how these metrics are calculated.

We are actively working on this report - lots more documentation and extra metrics to come!

Let us know what you think of it by mentioning @sourcery-ai in a comment.

BayesWatch / pytorch-experiments-template

dataparallel bug fix #57

Sourcery Code Quality Report

Legend and Explanation