creating first tensor takes 4 seconds

zeroexcuses commented 5 years ago

Consider the following code:

extern crate tch;
use tch::{Cuda, Tensor};

pub fn main() {
    println!("cuda: {:?}", Cuda::is_available());

    let opts = (tch::Kind::Float, tch::Device::Cuda(1));

    let start = std::time::Instant::now();
    let x_empty = Tensor::empty(&[5, 3], opts);
    let mid = std::time::Instant::now();

    let x_rand = Tensor::rand(&[5, 3], opts);
    let x_zeros = Tensor::zeros(&[5, 3], opts);
    let t = Tensor::of_slice(&[5, 3]);

    let end = std::time::Instant::now();

    println!("time to create 1st tensor: {:?}", mid - start);
    println!("time to create next 3 tensor: {:?}", end - mid);

    println!("start: {:?}", start);
    println!("mid: {:?}", mid);
    println!("end: {:?}", end);
}

I get results of:

cuda: true
time to create 1st tensor: 4.124049426s
time to create next 3 tensor: 907.468µs
start: Instant { tv_sec: 28481, tv_nsec: 825629454 }
mid: Instant { tv_sec: 28485, tv_nsec: 949678880 }
end: Instant { tv_sec: 28485, tv_nsec: 950586348 }

Clearly I am doing something wrong, as it should not take 4 seconds to initialize CUDA. What am I doing wrong?

LaurentMazare commented 5 years ago

I don't see anything obviously wrong in your code (I imagine that you have at least 2 gpus, as these are numbered from 0). Maybe you want to try using the cpu device to check that it's faster. In my case, cuda takes ~1s to initialize whereas cpu is ~instantaneous. I would not be surprised that it's the standard cost of initialization, you may want to run the same thing in a python script just to compare - for me the timings are pretty similar between python and rust (or ocaml when using ocaml-torch).

zeroexcuses commented 5 years ago

I have 3 GPUs. 1080 Ti = powering 4 monitors 1080 Ti = pure CUDA 980 Ti = pure CUDA

I'm now running

cargo run --example part00 --release

trying tch::Device::Cuda(0, 1, 2) we get:

cuda: true
device: Cuda(0)
time to create 1st: 4.297824364s
time to create next 3: 1.316713ms

== rerun

cuda: true
device: Cuda(1)
time to create 1st: 4.346540828s
time to create next 3: 1.063364ms

== rerun

cuda: true
device: Cuda(2)
time to create 1st: 4.010571072s
time to create next 3: 1.389762ms

It appears that for all 3 devices, there's a 4 s delay.

zeroexcuses commented 5 years ago

Testing python3:

cat test.py ; time python3 test.py;
import torch;
print(torch.zeros(5, 3).cuda());

tensor([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]], device='cuda:0')

real    0m5.087s
user    0m4.178s
sys     0m2.515s

LaurentMazare commented 5 years ago

The same delay across gpus is certainly expected, what if you replace tch::Device::Cuda(1) with tch::Device::Cpu ? I would expect it to take less than a couple milliseconds and all that is probably in line with python so comes more from cuda (or the way pytorch uses cuda).

zeroexcuses commented 5 years ago

cuda: true
device: Cpu
time to create 1st: 69.973µs
time to create next 3: 105.357µs

So we have: GPU(0-3): > 4s CPU: < 1 ms Pytorch Cuda: > 4s

zeroexcuses commented 5 years ago

rpm -qa | grep cuda-toolkit
cuda-toolkit-10-1-10.1.168-1.x86_64

python3
Python 3.7.2 (default, Mar 21 2019, 10:09:12) 
[GCC 8.3.1 20190223 (Red Hat 8.3.1-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch; print(torch.__version__);
1.1.0

GPU Info

nvidia-smi 
Mon Jul  8 01:47:11 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 980 Ti  Off  | 00000000:01:00.0 Off |                  N/A |
| 26%   38C    P8    12W / 250W |      1MiB /  6083MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:02:00.0  On |                  N/A |
| 17%   56C    P0    61W / 250W |    766MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:83:00.0 Off |                  N/A |
|  0%   43C    P8    10W / 250W |      2MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

This is clearly no longer a tch-rs issue as the PyTorch script above is also slow. Can you point me to where to look next to track down why PyTorch is taking > 4s to initialize?

zeroexcuses commented 5 years ago

Googling around, it seems like there is some type of Python / Torch JIT problem.

Either way, it's no longer a tch-rs issue, so I'm closing the issue. Thanks for helping me binary-search / narrow the issue.

LaurentMazare commented 5 years ago

No problem, and happy to get any feedback/issue/PR for tch-rs if you start using it.

zeroexcuses commented 5 years ago

This is a bit off topic, but would you say that Python / OCaml has fundamentally better support for PyTorch than Rust for the following reasons?

In Python/OCaml, due to having the REPL, we pay the CUDA init cost once, then just send code to the REPL and it never bothers us much.
In Rust, due to the lack of a good REPL, the closest thing to interactive development we can do is "unit test driven development" -- but the issue here is that every time we rerun a unit test, it's running a new program, which results in paying the CUDA init cost on every "iteration"

LaurentMazare commented 5 years ago

I guess it varies a lot depending on the use case but overall:

For pytorch I think the ocaml and rust experience are very similar and it depends more on your familiarity with the language (e.g. I'm quite faster using ocaml-torch).
Even for pure research in python, it's common not to do interactive development but write scripts and just iterate on the errors by running the scripts on smaller datasets.
The ocaml/rust equivalent consists in fixing type errors until it compiles, then the errors are mostly about dimensions. I don't feel that I'm substantially slower fixing these than I would be in a notebook.
Having good code completion is key, both ocaml and rust do well on this. So it's easier to prototype in python but at the cost of potential silly bugs and less good long term maintenance story. The python ecosystem helps a lot, you can get any possible library + tons of tutorials/FAQ can easily be found.

I don't do much 'test driven' development but I think you could run your tests in cpu mode (at least I do) until you're happy with them - and if the model are too large you can do the same thing as in python of iterating over smaller datasets/batch sizes/network depths.

zeroexcuses commented 5 years ago

Thanks for the insightful response. I'm familiar with Rust/IntelliJ and a novice with OCaml/Python.

I was trying to rewrite https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html#sphx-glr-beginner-blitz-tensor-tutorial-py line by line in Rust -- and thus was infuriating because things that would have been nearly-instantaneous in OCaml/Python had this annoying 4 second delay in Rust.

However, the nice thing is that since most DL development requires writing lots of code at once + running the code for long periods of time, the 4 second delay probably won't matter in "large scale" development. It's infuriating only because I'm trying to learn the API and running "tiny one line experiments."

LaurentMazare commented 5 years ago

This tutorial indeed looks like the typical use case where the repl in python or ocaml would help a lot. Hopefully once you start writing models that are a bit more involved, this would less be the case.

LaurentMazare / tch-rs

creating first tensor takes 4 seconds #61