UOB-AI / UOB-AI.github.io

A repository to host our documentations website.
https://UOB-AI.github.io
1 stars 3 forks source link

use GPU Computing #60

Open arbiasoula opened 1 week ago

arbiasoula commented 1 week ago

Hi, I have a big database CIFAR 10 and i would like to use GPU copmuting to gain time but i received this error """

image

""" what is the solution to save good time

asubah commented 1 week ago

Is it just a warning, or does the code crash?

BTW, you can create a venv with the packages versions you need under /data/datasets, follow these steps: https://github.com/UOB-AI/UOB-AI.github.io/issues/50#issuecomment-1951700690

arbiasoula commented 1 week ago

Yes the code crash when trying to use Partition GPU. I need to use this partition GPU because I have for a big database of size 70.000 and I have to calculate a matrix of size 70.000 * 70.000.

Le dim. 23 juin 2024 à 14:57, Abdulla Subah @.***> a écrit :

Is it just a warning, or does the code crash?

BTW, you can create a venv with the packages versions you need under /data/datasets, follow these steps:

50 (comment)

https://github.com/UOB-AI/UOB-AI.github.io/issues/50#issuecomment-1951700690

— Reply to this email directly, view it on GitHub https://github.com/UOB-AI/UOB-AI.github.io/issues/60#issuecomment-2184978850, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3PGW3SRZNU46XUZFIBLZF3ZI3A3FAVCNFSM6AAAAABJVTLEWGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBUHE3TQOBVGA . You are receiving this because you authored the thread.Message ID: @.***>

asubah commented 1 week ago

Hi, I created a Conda env with newer version of PyTorch. The env name is SimCLR-cuda-sm80. You can test your code in it.

arbiasoula commented 1 week ago

Thank you for your help, i tried to used but i got this error """

image

""" could you please installed the attached requirements file requirements (1).txt

asubah commented 1 week ago

The attached reqs file contains cudatoolkit 10.1 which is very old, and it will not work with the A100 GPUs in the gpu partition. I recommend that you try your code, and add the missing packages using pip install on the current environment.

arbiasoula commented 1 week ago

yes i have tried to install tqdm i got this message """

image

"""

asubah commented 1 week ago

Remove the ! mark form the command. The ! mark is used inside jupyter notebooks only to execute shell commands.

arbiasoula commented 1 week ago

"""

image image

""" unfortunately, i couldnt running the code torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.05 GiB. GPU

asubah commented 1 week ago

Why don't you switch to the gpu partition? This was the goal to begin with :)

asubah commented 1 week ago

BTW, unless you are going to use multi node training, please don't allocate more than one node.

arbiasoula commented 1 week ago

ok i will try the partition GPU with one node, I hope the problem will resolve

arbiasoula commented 1 week ago

The running of one epoch takes a lot of time almost 2 hours and i have 100 epoch. How could I save a good time for running the code. I have used GPU partition with one node

asubah commented 1 week ago

According to the logs, your code was utilizing less than 1% of the gpu compute cores. This is not ideal, especially since you loaded almost 30GB of data to the GPU RAM. Your code might have some bottle neck that is causing this issue. Can you share your code with me? You can save it somewhere under /data/datasets, and I will have a look into it.

arbiasoula commented 1 week ago

I have shared the code on /home/nfs/datasets/arbiaData/SimCLR-CIFAR10.zip

asubah commented 1 week ago

There are multiple issues in your code, but the biggest one is that you are doing Kmeans clustering on CPU, not GPU. To benefit from GPU acceleration, it is recommended that you do all the operations on the GPU. In your case, you are training on the GPU, but then you are moving the data back to the CPU to do clustering, and this happens every epoch. This is part of the output of cProfile:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)                                                                                                                        
   3878/1    0.119    0.000  222.787  222.787 {built-in method builtins.exec}                                                                                                                  
        1    0.005    0.005  222.787  222.787 run.py:1(<module>)                                                                                                                               
        1    0.002    0.002  217.331  217.331 run.py:54(main)
        1    0.042    0.042  213.637  213.637 simclr.py:60(train)
        1    0.000    0.000  199.859  199.859 simclr.py:117(kmeans_clustering)
        1    0.002    0.002  199.859  199.859 base.py:1457(wrapper)
        1    0.048    0.048  199.856  199.856 _kmeans.py:1453(fit)
       10  190.986   19.099  191.183   19.118 _kmeans.py:625(_kmeans_single_lloyd)

image

You can see from the profiler data that your training loop spends most of its time (133s / 145s) doing the clustering. I suggest that you find a GPU implementation for the Kmeans. You can check cuML or something similar.

arbiasoula commented 1 week ago

Thank you for your quick response. I will use this method kmeans and I need to install cuML with the version that match the envirement SimCLR-cuda-sm80 from cuml.cluster import KMeans as cuKMeans def kmeans_clustering(self, data, k): data_np = data.cpu().numpy().astype(np.float32) kmeans = cuKMeans(n_clusters=k, n_init=10) kmeans.fit(data_np) centroids = torch.tensor(kmeans.clustercenters).to(data.device) labels = torch.tensor(kmeans.labels_).to(data.device)

    return centroids, labels
asubah commented 1 week ago

I installed the packages. The code snippet you posted should be faster, but it won't be optimal. Since you are moving the result tensor from GPU to CPU, and then you move it again to GPU to do the clustering. This move causes an unnecessary overhead, you are better without it, but you need to figure out a way to convert the data structures in the GPU.

asubah commented 6 days ago

In addition, your cluster_acc method can also be computed on the GPU. And it will reduce the run time of your code, hopefully.

arbiasoula commented 6 days ago

it is ok, thank you for your help