Closed zeroexcuses closed 5 years ago
I don't see anything obviously wrong in your code (I imagine that you have at least 2 gpus, as these are numbered from 0). Maybe you want to try using the cpu device to check that it's faster. In my case, cuda takes ~1s to initialize whereas cpu is ~instantaneous. I would not be surprised that it's the standard cost of initialization, you may want to run the same thing in a python script just to compare - for me the timings are pretty similar between python and rust (or ocaml when using ocaml-torch).
I have 3 GPUs. 1080 Ti = powering 4 monitors 1080 Ti = pure CUDA 980 Ti = pure CUDA
I'm now running
cargo run --example part00 --release
trying tch::Device::Cuda(0, 1, 2) we get:
cuda: true
device: Cuda(0)
time to create 1st: 4.297824364s
time to create next 3: 1.316713ms
== rerun
cuda: true
device: Cuda(1)
time to create 1st: 4.346540828s
time to create next 3: 1.063364ms
== rerun
cuda: true
device: Cuda(2)
time to create 1st: 4.010571072s
time to create next 3: 1.389762ms
It appears that for all 3 devices, there's a 4 s delay.
Testing python3:
cat test.py ; time python3 test.py;
import torch;
print(torch.zeros(5, 3).cuda());
tensor([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]], device='cuda:0')
real 0m5.087s
user 0m4.178s
sys 0m2.515s
The same delay across gpus is certainly expected, what if you replace tch::Device::Cuda(1)
with tch::Device::Cpu
? I would expect it to take less than a couple milliseconds and all that is probably in line with python so comes more from cuda (or the way pytorch uses cuda).
cuda: true
device: Cpu
time to create 1st: 69.973µs
time to create next 3: 105.357µs
So we have: GPU(0-3): > 4s CPU: < 1 ms Pytorch Cuda: > 4s
rpm -qa | grep cuda-toolkit
cuda-toolkit-10-1-10.1.168-1.x86_64
python3
Python 3.7.2 (default, Mar 21 2019, 10:09:12)
[GCC 8.3.1 20190223 (Red Hat 8.3.1-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch; print(torch.__version__);
1.1.0
GPU Info
nvidia-smi
Mon Jul 8 01:47:11 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 980 Ti Off | 00000000:01:00.0 Off | N/A |
| 26% 38C P8 12W / 250W | 1MiB / 6083MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:02:00.0 On | N/A |
| 17% 56C P0 61W / 250W | 766MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... Off | 00000000:83:00.0 Off | N/A |
| 0% 43C P8 10W / 250W | 2MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
This is clearly no longer a tch-rs issue as the PyTorch script above is also slow. Can you point me to where to look next to track down why PyTorch is taking > 4s to initialize?
Googling around, it seems like there is some type of Python / Torch JIT problem.
Either way, it's no longer a tch-rs issue, so I'm closing the issue. Thanks for helping me binary-search / narrow the issue.
No problem, and happy to get any feedback/issue/PR for tch-rs if you start using it.
This is a bit off topic, but would you say that Python / OCaml has fundamentally better support for PyTorch than Rust for the following reasons?
In Python/OCaml, due to having the REPL, we pay the CUDA init cost once, then just send code to the REPL and it never bothers us much.
In Rust, due to the lack of a good REPL, the closest thing to interactive development we can do is "unit test driven development" -- but the issue here is that every time we rerun a unit test, it's running a new program, which results in paying the CUDA init cost on every "iteration"
I guess it varies a lot depending on the use case but overall:
I don't do much 'test driven' development but I think you could run your tests in cpu mode (at least I do) until you're happy with them - and if the model are too large you can do the same thing as in python of iterating over smaller datasets/batch sizes/network depths.
Thanks for the insightful response. I'm familiar with Rust/IntelliJ and a novice with OCaml/Python.
I was trying to rewrite https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html#sphx-glr-beginner-blitz-tensor-tutorial-py line by line in Rust -- and thus was infuriating because things that would have been nearly-instantaneous in OCaml/Python had this annoying 4 second delay in Rust.
However, the nice thing is that since most DL development requires writing lots of code at once + running the code for long periods of time, the 4 second delay probably won't matter in "large scale" development. It's infuriating only because I'm trying to learn the API and running "tiny one line experiments."
This tutorial indeed looks like the typical use case where the repl in python or ocaml would help a lot. Hopefully once you start writing models that are a bit more involved, this would less be the case.
Consider the following code:
I get results of:
Clearly I am doing something wrong, as it should not take 4 seconds to initialize CUDA. What am I doing wrong?