UOB-AI / UOB-AI.github.io

A repository to host our documentations website.
https://UOB-AI.github.io
1 stars 3 forks source link

graph model- cora dataset #43

Closed Amalsalem closed 9 months ago

Amalsalem commented 9 months ago

I am running a model for graph dataset , but I am getting this error msg. related installing some pkgs for graph:

2023-09-20 23:01:14.455038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 34166 MB memory:  -> device: 0, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:21:00.0, compute capability: 8.0
2023-09-20 23:01:14.457328: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 37700 MB memory:  -> device: 1, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:81:00.0, compute capability: 8.0
dims [1433, 500]
Traceback (most recent call last):
  File "/home/nfs/20015279/DynAE_Amal_G/train_G.py", line 272, in <module>
    gae_for(args)
  File "/home/nfs/20015279/DynAE_Amal_G/train_G.py", line 260, in gae_for
    model = DynAE(batch_size=batch_size, dataset=dataset, dims=[x_cpu.shape[-1],500, 500, 2000, 10], loss_weight=loss_weight_lambda, gamma=gamma, n_clusters=n_clusters, visualisation_dir=visualisation_dir, ws=ws, hs=hs, rot=rot, scale=scale)
  File "/home/nfs/20015279/DynAE_Amal_G/DynAE_fmnist.py", line 340, in __init__
    self.encoder = encoder_constructor(self.dims, self.visualisation_dir)
  File "/home/nfs/20015279/DynAE_Amal_G/DynAE_fmnist.py", line 182, in encoder_constructor
    plot_model(encoder, show_shapes=True, show_layer_names=True, to_file=visualisation_dir + '/graph/FcEncoder.png')
  File "/home/nfs/20015279/.local/lib/python3.9/site-packages/keras/src/utils/vis_utils.py", line 464, in plot_model
    raise ImportError(message)
ImportError: You must install pydot (`pip install pydot`) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) for plot_model to work.
Amalsalem commented 9 months ago

stilli have this problem image

itried to do !pip install pydot but still

asubah commented 9 months ago

The other package graphviz is required too. You can compile it using Spack.

Our Spack repo is located at /data/software/spack/ So you can do:

source /data/software/spack/share/spack/setup-env.sh

And then compile it. You can try to do it by following the Spack docs: https://spack.readthedocs.io I will write step-by-step instructions once I get the time, probably by the end of the coming week.

Amalsalem commented 9 months ago

i did this git clone -c feature.manyFiles=true https://github.com/spack/spack.git cd spack/bin ./spack install libelf

and Successfully installed libelf-0.8.13-g6qtg57wxhrintzqmppjrhdjw64chn6w

but i think it installed in my environment and now i have space issue Filesystem Size Used Avail Use% Mounted on 10.240.240.3:/ifs/data/adhari/zone1/nfs 10G 6.3G 3.8G 63% /home/nfs

i now again have kernal died

image

asubah commented 9 months ago

You don't have to git clone spack and start from scratch. As I said in my previous comment, we have a repo in /data/software/spack/ with many packages already installed. You can use it and compile whatever you need on your user space or even better on your project space. For now, you can just delete the spack directory that you cloned, and the space issue should be resolved.

asubah commented 9 months ago

You can try now.

Amalsalem commented 9 months ago

i tried it . but i have problem in the import , this wasn't appearing before:

Traceback (most recent call last):
  File "/home/nfs/20015279/DynAE_Amal_G/train_G.py", line 15, in <module>
    import scipy.sparse as sp
ModuleNotFoundError: No module named 'scipy'
asubah commented 9 months ago

Did you try pip install scipy?

Amalsalem commented 9 months ago

yes. but after i get this msg WARNING: Ignoring invalid distribution -rotobuf (/home/nfs/20015279/.local/lib/python3.9/site-packages) Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com/ Requirement already satisfied: scipy in /data/software/miniconda3/lib/python3.9/site-packages (1.10.1) Requirement already satisfied: numpy<1.27.0,>=1.19.5 in /home/nfs/20015279/.local/lib/python3.9/site-packages (from scipy) (1.24.3) WARNING: Ignoring invalid distribution -otobuf (/home/nfs/20015279/.local/lib/python3.9/site-packages)

asubah commented 9 months ago

Then please be more specific, what are you trying to run? Where are your files located? Which part of the code is giving you this error? What environment / kernel are you using?

Amalsalem commented 9 months ago

all my files in my space . only result in datasets folder. i know which part of the code give the error and I sent it earlier. kernel use the server environment , I don't have separate environment.

and the msg that shown earlier because of the pydot / and graphviz but per your investigation : you mention that it because of the Spack ,

today I noticed that i have problem in scipy which wasn't before !!

so anything from my side !! i was waiting the steps to compile Spack

regards,

asubah commented 9 months ago

all my files in my space . only result in datasets folder. i know which part of the code give the error and I sent it earlier.

You know but I don't know, so please specify the file that you are trying to run, and how are you trying to run it.

kernel use the server environment , I don't have separate environment.

This means you are using the base environment or Python 3 kernel if you are in a notebook.

and the msg that shown earlier because of the pydot / and graphviz but per your investigation : you mention that it because of the Spack ,

I said you should use spack to compile graphviz, but I already did that for you to save time and that is why I told you to try your code again.

today I noticed that i have problem in scipy which wasn't before !!

so anything from my side !! i was waiting the steps to compile Spack

regards,

This is hopefully because your code moved past the graphviz error. I can't tell for sure untill I know where to look at. I tried to import scipy from your home directory and it worked for me with no issues.

Amalsalem commented 9 months ago

![Uploading image.png…]()

unfortunately it did not. the error now I found it in the import section , before I reached the code that give me graphviz error

Amalsalem commented 9 months ago

the error in this file (line 15 ) in the import

this is the link of my notebook:

https://hayrat.uob.edu.bh/node/gpu01/46324/lab/tree/DynAE_Amal_G/notebook.ipynb

Amalsalem commented 9 months ago

and this is the code :

import torch.nn as nn
import torch.nn.parallel
import random
import argparse
from network.resnet import resnet18, resnet34
from network.pointnet import PointNetCls
from torch.utils.data import DataLoader
import os
import numpy as np
from data.cifar10_train_val_test import CIFAR10, CIFAR100
from data.modelnet40 import ModelNet40
import torch.optim as optim
from torch.optim import lr_scheduler
import torchvision.transforms as transforms
from termcolor import cprint
from knn_utils import calc_knn_graph, calc_topo_weights_with_components_idx
from noise import noisify_with_P, noisify_cifar10_asymmetric, \
    noisify_cifar100_asymmetric, noisify_pairflip, noisify_modelnet40_asymmetric
import copy
from scipy.stats import mode
asubah commented 9 months ago

I found the code in this file ~/DynAE_Amal_G/train_G.py, and you are running it from the notebook.

I made some changes to the cluster environment, you can try again now.

Amalsalem commented 9 months ago

thanks a lot . the problem solved

now i have another issue related memory.
i am trying to convert from tourch to tenserflow, and as i read this operation take size from memory .

so do you have any suggestions. i tried to reduce size , dimension , but still. this is what i get:

2023-10-02 11:46:16.374317: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-02 11:46:16.414221: F tensorflow/tsl/platform/statusor.cc:33] Attempting to fetch value instead of handling error INTERNAL: failed initializing StreamExecutor for CUDA device ordinal 1: INTERNAL: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY: out of memory; total memory reported: 42338615296
Amalsalem commented 9 months ago

to be more specific : the error here i load my data in dataset-- and this is fine , then I load it , then i want to convert to tenserflow. ( here the problem )

import numpy as np
import tensorflow as tf

# Load the .npz file
npz_file = np.load('/home/nfs/datasets/Results_20015279/epoch_49_data.npz')

# Access the arrays inside the .npz file
x_array = npz_file['x']  # Assuming 'x' is the name of the array in the .npz file
y_array = npz_file['y']  # Assuming 'y' is the name of the array in the .npz file
print(x_array)
print(y_array)

# Convert the arrays to TensorFlow tensors
x_tensor = tf.convert_to_tensor(x_array, dtype=tf.float32)
y_tensor = tf.convert_to_tensor(y_array, dtype=tf.float32)
Amalsalem commented 9 months ago

i tried to just check the convert to TensorFlow, to make sure if the problem with my code or with the environment, so i run a simple line for the convert without referring to my dataset , and it gives me same error. (the kernel died and will restart)

import tensorflow as tf
import numpy as np

# Create a NumPy array
numpy_array = np.array([[1, 2, 3], [4, 5, 6]])

# Convert the NumPy array to a TensorFlow tensor
tensor = tf.convert_to_tensor(numpy_array)
asubah commented 9 months ago

First, please use code blocks to make your code / errors / terminal output easily readable :) You can read about Markdown code blocks here: https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax#quoting-code

Now about the issue, our GPU partition is having very high load currently and the GPU RAM is almost always full. So even if you try to allocate a small tensor on the GPU the code will fail. So as I mentioned before you can check if the GPU RAM is full by following the instructions in this comment: https://github.com/UOB-AI/UOB-AI.github.io/issues/37#issuecomment-1623425554

One solution is to work on the Compute partition, where we have the smaller NVIDIA Tesla T4 GPUs. In this partition, you will almost always get a full GPU for yourself.

I ran the test code on the T4 and it works:

(base) [asubah@cn02 notebooks]$ cat test.py 
import tensorflow as tf
import numpy as np

# Create a NumPy array
numpy_array = np.array([[1, 2, 3], [4, 5, 6]])

# Convert the NumPy array to a TensorFlow tensor
tensor = tf.convert_to_tensor(numpy_array)
(base) [asubah@cn02 notebooks]$ python test.py 
2023-10-03 10:55:50.205731: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-03 10:55:50.333551: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-10-03 10:55:53.872147: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-03 10:55:54.585599: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 546 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:3b:00.0, compute capability: 7.5

On the A100 it fails:

(base) [asubah@gpu01 notebooks]$ cat test.py 
import tensorflow as tf
import numpy as np

# Create a NumPy array
numpy_array = np.array([[1, 2, 3], [4, 5, 6]])

# Convert the NumPy array to a TensorFlow tensor
tensor = tf.convert_to_tensor(numpy_array)
(base) [asubah@gpu01 notebooks]$ python test.py 
2023-10-03 10:56:02.635965: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-03 10:56:06.607514: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-03 10:56:07.007821: F tensorflow/tsl/platform/statusor.cc:33] Attempting to fetch value instead of handling error INTERNAL: failed initializing StreamExecutor for CUDA device ordinal 1: INTERNAL: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY: out of memory; total memory reported: 42338615296
asubah commented 9 months ago

And this one works also:

(base) [asubah@cn02 notebooks]$ cat test.py 
import numpy as np
import tensorflow as tf

# Load the .npz file
npz_file = np.load('/home/nfs/datasets/Results_20015279/epoch_49_data.npz')

# Access the arrays inside the .npz file
x_array = npz_file['x']  # Assuming 'x' is the name of the array in the .npz file
y_array = npz_file['y']  # Assuming 'y' is the name of the array in the .npz file
print(x_array)
print(y_array)

# Convert the arrays to TensorFlow tensors
x_tensor = tf.convert_to_tensor(x_array, dtype=tf.float32)
y_tensor = tf.convert_to_tensor(y_array, dtype=tf.float32)(base) [asubah@cn02 notebooks]$ python test.py 
2023-10-03 11:03:21.152457: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-03 11:03:21.278858: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[[ 0.8118281   0.22468564  0.46013135 ...  1.5575546   0.5395082
  -2.055624  ]
 [-0.03136274 -1.41821    -1.001655   ... -0.78268284 -1.5216196
   1.4538925 ]
 [ 0.5761268  -0.5091106  -0.4394772  ... -1.7542138   0.9698049
   0.17309693]
 ...
 [-0.59655136 -1.3365344   0.70856726 ... -0.9167519  -0.27799925
   1.074443  ]
 [ 0.90005803  0.59554154  0.56071055 ...  0.6097251  -3.7520244
  -0.90183574]
 [ 1.2626469  -1.2455169   1.6501309  ... -0.12571615 -0.6929587
  -0.28050414]]
[[ 0.13224526]
 [ 0.93610305]
 [ 0.86743486]
 ...
 [ 0.47754258]
 [ 1.7060387 ]
 [-1.61119   ]]
2023-10-03 11:03:24.992788: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-03 11:03:25.709955: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 546 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:3b:00.0, compute capability: 7.5
Amalsalem commented 9 months ago

this solve the kernel died problem .

Now i have the original error that related to Graphviz . not recognizing the "png" format for the model visualization.

the is the error: "dot" with args ['-Tpng', '/tmp/tmpzf362372'] returned code: 1

stdout, stderr: b'' b'Format: "png" not recognized. Use one of: canon cmap cmapx cmapx_np dot dot_json eps fig gv imap imap_np ismap json json0 mp pic plain plain-ext pov ps ps2 svg svgz tk xdot xdot1.2 xdot1.4 xdot_json\n'

image

the run in the (compute )

Amalsalem commented 9 months ago

and it is exactly in this line when I do the plot_model

plot_model(dynAE, show_shapes=True, show_layer_names=True, to_file=visualisation_dir + '/graph/FcDynAE.png')

asubah commented 9 months ago

You can try again now.

Amalsalem commented 9 months ago

it works now thanks

On Sat, Sep 23, 2023 at 2:20 PM Abdulla Subah @.***> wrote:

The other package graphviz is required too. You can compile it using Spack.

Our Spack repo is located at /data/software/spack/ So you can do:

source /data/software/spack/share/spack/setup-env.sh

And then compile it. You can try to do it by following the Spack docs: https://spack.readthedocs.io I will write step-by-step instructions once I get the time, probably by the end of the coming week.

— Reply to this email directly, view it on GitHub https://github.com/UOB-AI/UOB-AI.github.io/issues/43#issuecomment-1732286378, or unsubscribe https://github.com/notifications/unsubscribe-auth/AY3B76C4PW2A5DBC4BHY5ZLX33AY7ANCNFSM6AAAAAA5AR3FQA . You are receiving this because you authored the thread.Message ID: @.***>