how to use GPU options in the train.py command?

amoudomola commented 4 years ago

Hi. Tivadar,

I'm afraid you may not available during this year-end period. I have tried training with my humble notebook setup which took whole lot of time. Even with a computer having 8 core Xeon processor, it took about 8 days to train with 4 training images, so I realized I need GPU setup.

I know it's too much for me, but I ended up getting a GTX 2080. (I was really hesitant to buy 20xx series cause there are so many complain about 20xx series, not to mention its notorious driver installation. I still have complete black screen even without cursor blink, but ssh is working fine, and ready(?) for my analysis.)

Even though, you mentioned options for GPU settings including --batch size and --device in the tutorial, it's still difficult for biologist like me who have little scripting background to write proper commands with no examples. Would you guide me a little more how to use GPU in your train script?

Another thing I want to add is that I got an error message when I ran the train.py script. It was an error related to missing def of tf_train at line number 27.

I just moved the following lines in your train.py script up at the very beginning of the data loading code block. Not sure it's right or not. It could be just because of my python setup or something. Anyway, it solved my problem, and there might be others having the same issue like me with the train.py. You could look it up, and correct it properly if there is anything that could be done better.

tf_train = make_transform(crop=(512, 512), long_mask=True, p_random_affine=0.0)
tf_validate = make_transform(crop=(512, 512), long_mask=True, rotate_range=False,
                            p_flip=0.0, normalize=False, color_jitter_params=None)

Thanks in advance and thank you for your comment of the earlier issue as well, Goh

cosmic-cortex commented 4 years ago

Hi!

No worries, I am available. Feel free to message me anytime or just reply to this thread, I get an email notification about every post so I won't miss it. To be honest I am a little bored during holidays so I work anyway :)

What kind of operating system you have on the computer with the GTX 2080? Perhaps I can help you install the driver, if you specify it for me. If you would like to check whether the GPU is accessible by PyTorch (the deep learning framework used in the code), you can execute this in a Python console:

import torch

torch.cuda.is_available()

If the second one returns True, you are good to go.

Thanks for the error report! I have added the missing transforms. It was probably removed during a refactor, where I cleaned up the code to make this repository. There is a lesson to be learned here for me: I should always treat every piece of code I write as it is intended for production :)
For a concrete example on the command used to run training, you can try something like this.

python3 train.py --train_dataset /path/to/train/data \
--val_dataset /path/to/validation/data \
--batch_size 2 \
--device cuda \

This ran for me without errors. I would suggest running nvidia-smi during training and monitoring the memory usage, increasing the batch_size if there is memory left. The output is something like this, dependent on your hardware:

Thu Dec 26 19:35:43 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 166...  Off  | 00000000:01:00.0  On |                  N/A |
| N/A   47C    P8    11W /  N/A |    478MiB /  5914MiB |     13%      Default |
+-------------------------------+----------------------+----------------------+

You can see the memory usage on the last line of this. In general, the larger batches you can use the better. For the result of the paper, a batch size of 4 was used.

amoudomola commented 4 years ago

Hi. Tivadar,

Thank you for your prompt response.

1. First, I installed a driver with a .run file (440.44) from NVIDIA website. CUDA (10.2.89) was installed using apt-get package manager by following commands from the NVIDIA website as well. Later, I reinstalled the driver using apt-get (440.44) while I was trying troubleshooting the black screen issue hoping to fix the display. In both cases, driver installation seemed to be OK. I didn't install cuDNN.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44       Driver Version: 440.44       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:01:00.0 Off |                  N/A |
| 35%   32C    P8    20W / 260W |      7MiB / 11018MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1128      G   /usr/lib/xorg/Xorg                             5MiB |
+-----------------------------------------------------------------------------+

cuda seems to work as well.

>>> import torch
>>> torch.cuda.is_available()
True

Regarding the black display problem I have, I think it's kind of a platform specific issue, and I guess you would get tons of questions if you start responding/answering these issues. I might need to ask that to NVIDIA tech support team.

But ;->, I have Ubuntu 18.04 environment with a default (gdm?) GUI manager and 'R'TX 2080Ti black 11G memory. Originally, this server only had a integrated intel(?) graphics; and was connected to a monitor with old VGA cable. If I put the RTX board in, I get nothing on the screen, and that was still the same with HDMI connection to the new 2080 graphic card. I have left it as it was (VGA cable), so it's like a dual graphic card(?) setup now. I have read that people at least get cursor, but I get nothing. However, it seems like it's just a screen output problem. If I attemp logging-in by typing id and passwd after pressing many(?) ENTERs or shorcut like Ctrl+Alt+F3, I can access to the server through SSH remotely.

I might have mixed up installation processes or created configuration files in wrong places cause I'm not that familiar with linux system, naturally so clumsy as well, and obtained solutions from here and there. I did the followings through SSH, although I don't remember them all exactly (so, it may not be very helpful).

I haven't tried all the combinations I could try yet, and I might need to turn back on the gdm now by deleting/updating the configuration files I manipulated. I'm also thinking of assigning monitor and display setting in the Xorg configuration file manually.

If there is no simple solution you have in mind, then please don't be botherd. In addition, our lab server is busy with another job at the moment, I have to wait a bit to fix the display issue.

1-1. disabled nouveau by creating blacklist-nouveau.conf file.
/etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0
alias nouveau off
1-2. stop the display manager
sudo service gdm stop
sudo systemctl stop gdm (maybe it's redundant??)
1-3. changed the GRUB file as well from GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"to GRUB_CMDLINE_LINUX_DEFAULT="quiet splash nouveau.modeset=0"
sudo update-grub
1-4. confirmed kernel module
lsmod | grep nouveau  #(and others for gdm processes, no nouveau or gdm or any other graphic daemons were loaded)

update-initramfs -u
1-5. reboot

1-6. setup BIOS disabled fast boot, platform keys were unloaded, Other OS for the OS type (I actually don't know what these are except disabling fast boot) I don't remember CSM(compatibility support module) either.

1-7. switch to runlevel 3 sudo init 3

1-8. installed 2080ti driver

1-9. reboot, and installed CUDA

1-10. reboot

2. Thank you for reviewing and fixing this.

3. It looks like there is nothing much to change the command. I thought that I need to be very specific for the--batch_size, like 11G or something; and for the --device, just --device cuda seems to be enough instead of some numbers for the cuda:$ID.

Thanks again, Goh

darvida commented 4 years ago

I use the --device=cuda:0 and after a couple of seconds i get back to the terminal window without the training starting and without any error. torch.cuda.is_available() returns true. I cant figure out what might be wrong..

cosmic-cortex commented 4 years ago

--device=cuda:0 is not formatted properly, the = sign is not needed. Can you try --device cuda:0 or simply --device cuda?

darvida commented 4 years ago

I reinstalled cuda and got the training to start, but now i get the following error instead:

C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:208: block: [883,0,0], thread: [12,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:208: block: [885,0,0], thread: [12,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:208: block: [881,0,0], thread: [12,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. Traceback (most recent call last): File "train.py", line 70, in verbose=False, save_freq=args.model_save_freq) File "C:\hypocotyl-UNet-master\src\unet\utils.py", line 298, in train_model epoch_running_loss += training_loss.item() RuntimeError: CUDA error: device-side assert triggered

cosmic-cortex commented 4 years ago

Unfortunately, I am unable to tell what is wrong here. Can you put a break point in line 298 of the utils.py and inspect the training_loss?

darvida commented 4 years ago

I got it to work, i printed out the image name for every step, and removed the image that caused the error, now it seems to work fine :)

cosmic-cortex commented 4 years ago

Awesome! When something like this happens, there is a high chance that one (or a few) images are not according to the specification, so this is certainly a useful debugging technique.

biomag-lab / hypocotyl-UNet

how to use GPU options in the train.py command? #2