BayesWatch / pytorch-experiments-template

A pytorch based classification experiments template
GNU General Public License v3.0
5 stars 1 forks source link

dataparallel bug fix #57

Closed jack-willturner closed 3 years ago

jack-willturner commented 3 years ago

We had DistributedDataParallel set instead of DataParallel. I assume we don't want support for distributed training? I'm not even sure how you would go about it on our machines.

This meant that you could use num_gpus_to_use > 1.

I also changed state dict loading to be more idiomatic (i.e. dataparallel modules are unwrapped and saved so that there is no need to overwrite the names of the dictionary keys when reloading).

AntreasAntoniou commented 3 years ago

@jack-willturner DistributedDataParallel was intentional. I found it in an Nvidia cheatsheet. When you use it on a single machine with multiple GPUs, it assigns one CPU core to each GPU being used, as opposed to one CPU core to all GPUs. Therefore producing measurable computational efficiency improvements.

jack-willturner commented 3 years ago

Right OK.

I've never used DistributedDataParallel before, so I have a few questions:

AntreasAntoniou commented 3 years ago

Let's keep data parallel as you proposed for now. The distributed data parallel solution takes more time to setup, that I don't have right now. I'll be taking care of it, just not today. Let's leave it at a place where it functions.

jack-willturner commented 3 years ago

Would be good to go through that cheat sheet at a later date and quantify what kinds of improvements each optimisation brings.

AntreasAntoniou commented 3 years ago

Agreed. For reference https://nvlabs.github.io/eccv2020-mixed-precision-tutorial/files/szymon_migacz-pytorch-performance-tuning-guide.pdf

sourcery-ai[bot] commented 3 years ago

Sourcery Code Quality Report

✅  Merging this PR will increase code quality in the affected files by 0.23%.

Quality metrics Before After Change
Complexity 16.81 🙂 18.52 😞 1.71 👎
Method Length 103.41 🙂 102.00 🙂 -1.41 👍
Working memory 17.10 ⛔ 16.97 ⛔ -0.13 👍
Quality 45.46% 😞 45.69% 😞 0.23% 👍
Other metrics Before After Change
Lines 628 621 -7
Changed files Quality Before Quality After Quality Change
train.py 29.04% 😞 28.31% 😞 -0.73% 👎
utils/storage.py 68.14% 🙂 71.45% 🙂 3.31% 👍

Here are some functions in these files that still need a tune-up:

File Function Complexity Length Working Memory Quality Recommendation
utils/storage.py build_experiment_folder 10 🙂 138 😞 12 😞 50.90% 🙂 Try splitting into smaller methods. Extract out complex expressions
train.py get_base_argument_parser 0 ⭐ 232 ⛔ 9 🙂 58.32% 🙂 Try splitting into smaller methods
utils/storage.py download_file 4 ⭐ 104 🙂 11 😞 62.58% 🙂 Extract out complex expressions
train.py train 1 ⭐ 95 🙂 11 😞 66.80% 🙂 Extract out complex expressions
utils/storage.py restore_model 5 ⭐ 75 🙂 10 😞 67.64% 🙂 Extract out complex expressions

Legend and Explanation

The emojis denote the absolute quality of the code:

The 👍 and 👎 indicate whether the quality has improved or gotten worse with this pull request.


Please see our documentation here for details on how these metrics are calculated.

We are actively working on this report - lots more documentation and extra metrics to come!

Let us know what you think of it by mentioning @sourcery-ai in a comment.