Open arbiasoula opened 1 week ago
Is it just a warning, or does the code crash?
BTW, you can create a venv with the packages versions you need under /data/datasets
, follow these steps:
https://github.com/UOB-AI/UOB-AI.github.io/issues/50#issuecomment-1951700690
Yes the code crash when trying to use Partition GPU. I need to use this partition GPU because I have for a big database of size 70.000 and I have to calculate a matrix of size 70.000 * 70.000.
Le dim. 23 juin 2024 à 14:57, Abdulla Subah @.***> a écrit :
Is it just a warning, or does the code crash?
BTW, you can create a venv with the packages versions you need under /data/datasets, follow these steps:
50 (comment)
https://github.com/UOB-AI/UOB-AI.github.io/issues/50#issuecomment-1951700690
— Reply to this email directly, view it on GitHub https://github.com/UOB-AI/UOB-AI.github.io/issues/60#issuecomment-2184978850, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3PGW3SRZNU46XUZFIBLZF3ZI3A3FAVCNFSM6AAAAABJVTLEWGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBUHE3TQOBVGA . You are receiving this because you authored the thread.Message ID: @.***>
Hi, I created a Conda env with newer version of PyTorch. The env name is SimCLR-cuda-sm80
.
You can test your code in it.
Thank you for your help, i tried to used but i got this error """
""" could you please installed the attached requirements file requirements (1).txt
The attached reqs file contains cudatoolkit 10.1 which is very old, and it will not work with the A100 GPUs in the gpu partition. I recommend that you try your code, and add the missing packages using pip install on the current environment.
yes i have tried to install tqdm i got this message """
"""
Remove the !
mark form the command. The !
mark is used inside jupyter notebooks only to execute shell commands.
"""
""" unfortunately, i couldnt running the code torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.05 GiB. GPU
Why don't you switch to the gpu partition? This was the goal to begin with :)
BTW, unless you are going to use multi node training, please don't allocate more than one node.
ok i will try the partition GPU with one node, I hope the problem will resolve
The running of one epoch takes a lot of time almost 2 hours and i have 100 epoch. How could I save a good time for running the code. I have used GPU partition with one node
According to the logs, your code was utilizing less than 1% of the gpu compute cores. This is not ideal, especially since you loaded almost 30GB of data to the GPU RAM. Your code might have some bottle neck that is causing this issue. Can you share your code with me? You can save it somewhere under /data/datasets
, and I will have a look into it.
I have shared the code on /home/nfs/datasets/arbiaData/SimCLR-CIFAR10.zip
There are multiple issues in your code, but the biggest one is that you are doing Kmeans clustering on CPU, not GPU.
To benefit from GPU acceleration, it is recommended that you do all the operations on the GPU. In your case, you are training on the GPU, but then you are moving the data back to the CPU to do clustering, and this happens every epoch.
This is part of the output of cProfile
:
ncalls tottime percall cumtime percall filename:lineno(function)
3878/1 0.119 0.000 222.787 222.787 {built-in method builtins.exec}
1 0.005 0.005 222.787 222.787 run.py:1(<module>)
1 0.002 0.002 217.331 217.331 run.py:54(main)
1 0.042 0.042 213.637 213.637 simclr.py:60(train)
1 0.000 0.000 199.859 199.859 simclr.py:117(kmeans_clustering)
1 0.002 0.002 199.859 199.859 base.py:1457(wrapper)
1 0.048 0.048 199.856 199.856 _kmeans.py:1453(fit)
10 190.986 19.099 191.183 19.118 _kmeans.py:625(_kmeans_single_lloyd)
You can see from the profiler data that your training loop spends most of its time (133s / 145s) doing the clustering. I suggest that you find a GPU implementation for the Kmeans. You can check cuML or something similar.
Thank you for your quick response. I will use this method kmeans and I need to install cuML with the version that match the envirement SimCLR-cuda-sm80 from cuml.cluster import KMeans as cuKMeans def kmeans_clustering(self, data, k): data_np = data.cpu().numpy().astype(np.float32) kmeans = cuKMeans(n_clusters=k, n_init=10) kmeans.fit(data_np) centroids = torch.tensor(kmeans.clustercenters).to(data.device) labels = torch.tensor(kmeans.labels_).to(data.device)
return centroids, labels
I installed the packages. The code snippet you posted should be faster, but it won't be optimal. Since you are moving the result tensor from GPU to CPU, and then you move it again to GPU to do the clustering. This move causes an unnecessary overhead, you are better without it, but you need to figure out a way to convert the data structures in the GPU.
In addition, your cluster_acc
method can also be computed on the GPU. And it will reduce the run time of your code, hopefully.
it is ok, thank you for your help
Hi, I have a big database CIFAR 10 and i would like to use GPU copmuting to gain time but i received this error """
""" what is the solution to save good time