Investigate more GPU usage

DigitalSlideArchive / superpixel-classification

A test cli for classifying superpixels with arbitrary labels

Other

5 stars 2 forks source link

Investigate more GPU usage #17

Closed manthey closed 1 week ago

manthey commented 1 month ago

It appears that we aren't using the GPU for some of the model training steps. For instance, in the batch bald mode, it looks like the GPU only gets used when computing the sampled joint entropy. Can we use it during model training? During the beginning of the prediction process? I haven't checked the usage during the other metrics.

Further, is there a way to increase the batch size based on the GPU memory? It looks like we only use a small fraction of available memory.

manthey commented 1 month ago

@Leengit If you want to investigate this, please do so.

Leengit commented 1 month ago

Whether the GPU gets used by torch frequently has everything to do with how torch was installed. To document it in this issue ...

The torch installation process is not easy. Sometimes one can first install light-the-torch and then use it. I have had better luck using pip, with something like --extra-index-url https://download.pytorch.org/whl/cu117. In this case, 117 is specifying that the system we are installing on uses CUDA 11.7, so that can be changed to 125 or whatever is appropriate.

If that is not the issue here then the code may need some .device("cuda") (or .device("cuda:0")) calls in key places for both the model and the data to be processed. If the GPU will be available only sometimes then we have to check for its presence first with something like device = "cuda" if torch.cuda.is_available() and torch.cuda.device_count() > 0 else "cpu".

manthey commented 1 month ago

In the same docker container, torch uses the gpu with the batch bald step where it prints that it is computing the sampled joint entropy. I assume that we need to add a .device('cuda') (with the same conditionals used elsewhere), but my naïve guess at where to put that for training was wrong.

manthey commented 1 month ago

And, I see that the other metrics where tensorflow is used the gpu does get used for both training and prediction.

Leengit commented 1 month ago

Okay, so it looks like we'll need one or more well placed invocations of .device(device). If this isn't under someone else's wing by that time, I can take a look starting August 19.

Leengit commented 1 week ago

I've added Issue https://github.com/DigitalSlideArchive/superpixel-classification/issues/22 to track computation of an optimal batch size.