[Improvement Suggestion] Device Handling - Reduce CPU-GPU Overhead

atong01 / conditional-flow-matching

TorchCFM: a Conditional Flow Matching library

https://arxiv.org/abs/2302.00482

MIT License

1.25k stars 101 forks source link

[Improvement Suggestion] Device Handling - Reduce CPU-GPU Overhead #102

Closed ImahnShekhzadeh closed 10 months ago

ImahnShekhzadeh commented 10 months ago

Hi!

I noticed that in several places of the code, e.g. in the conditional_mnist.ipynb tutorial, there's a slight CPU-GPU overhead. For example, the line

torch.randn(100, 1, 28, 28).to(device)

first creates the tensor on the CPU and then copies it onto the GPU (in case device is cuda:0).

It is possible to directly create the tensor on the specified device via

torch.randn(100, 1, 28, 28, device=device)

which is more memory-efficient, since there's no CPU-GPU overhead.

kilianFatras commented 10 months ago

Hello,

You are right. It would be indeed more efficient to create tensors this way. I do not mind the current implementation as the purpose of tutorials is to show how we can train flow matching methods. I will check for cifar10 training but that will have to wait a bit.

ImahnShekhzadeh commented 10 months ago

train_cifar10.py is indirectly affected by this, since utils_cifar.py has some CPU-GPU overhead, e.g. in L37 in the generate_samples() function. Some other Python scripts are affected by this as well.

I am actually currently working on fixing this, I can open a draft PR once I am finished.

atong01 commented 10 months ago

That's great, yep a simple improvement. I was never too worried about this, as I was pretty sure that it didn't matter with most of the time in the Unet. Would be very curious if you notice a speedup! Regardless slightly better and welcome the PR.

ImahnShekhzadeh commented 10 months ago

I performed some tests by running train_cifar10.py, on a 4090, I did not notice a speedup (some minor fluctuations only). I will open a PR shortly.