barthelemymp / TULIP-TCR

GNU General Public License v3.0
10 stars 3 forks source link

Segmentation fault #1

Open shashidhar22 opened 10 months ago

shashidhar22 commented 10 months ago

I am trying to run the run_full_learning.sh script to train the full TULIP-TCR model. I am running into a segmentation following these logging message

2023-11-14 15:58:54.621704: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-11-14 15:58:54.666307: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-14 15:58:54.666353: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-14 15:58:54.666379: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-14 15:58:54.674214: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
/home/sravisha/.local/lib/python3.9/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.2
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
15:59:03 INFO: This is wandb-osh v1.2.0 using communication directory /home/sravisha/.wandb_osh_command_dir
loading hyperparameter
Using device: cuda
Using pad_token, but it is not set yet.
Using sep_token, but it is not set yet.
Using cls_token, but it is not set yet.
Using eos_token, but it is not set yet.
Using mask_token, but it is not set yet.
Loading models ..
self.pad_token_id 1
self.pad_token_id 1
self.pad_token_id 1
The model has 10,994,236 trainable parameters
Segmentation fault (core dumped)

Is this just a memory issue? Or is there some other explanation?

For context, I am trying to run the code on linux node with 36 cores and 650GB of RAM. Below are the nvidia-smi output

Tue Nov 14 16:04:44 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     Off | 00000000:3B:00.0 Off |                  N/A |
| 30%   30C    P8              14W / 250W |      3MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Please let me know if you need any additional information

barthelemymp commented 10 months ago

Hello,

Not sure if that is the problem but you should adapt the slurm command of the sh script to your own machine/cluster. maybe try to just launch the .py in an interactive shell to see if this is the problem. python full_learning.py --train_dir ../mydatapath/dataNew/Full_train_pretune_mhcX_2.csv \ --test_dir ../mydatapath/dataNew/VDJ_test_2.csv \ --modelconfig configs/shallow.config.json \ --save multiTCR_s_flex