JDACS4C-IMPROVE / Singularity

Singularity definitions that can be extended to support execution of community models.
MIT License
3 stars 5 forks source link

Test tCNNS #44

Closed wilke closed 1 year ago

wilke commented 1 year ago

Command: singularity exec --bind ${TEST_DIR}:/candle_data_dir build/tCNNS.sif train.sh 5 /candle_data_dir --epochs 1 Status: In progress Depends on:

Output:

--epochs is not a file
CMD = python /usr/local/tCNNS-Project/tcnns_baseline_tensorflow.py --epochs 1
using CUDA_VISIBLE_DEVICES 5
using CANDLE_DATA_DIR /candle_data_dir
using CANDLE_CONFIG
running command python /usr/local/tCNNS-Project/candle_data_download.py
Importing candle utils for Keras
Downloading data from https://ftp.mcs.anl.gov/pub/candle/public/improve/model_curation_data/tCNNS/tcnns_data_processed.tar.gz
3112960/3456199 [==========================>...] - ETA: 0s
Unpacking file...
running command python /usr/local/tCNNS-Project/tcnns_baseline_tensorflow.py --epochs 1
Importing candle utils for Keras
Traceback (most recent call last):
  File "/usr/local/tCNNS-Project/tcnns_baseline_tensorflow.py", line 13, in <module>
    import improve_utils
ModuleNotFoundError: No module named 'improve_utils'
adpartin commented 1 year ago

@wilke I think for the upcoming release (end of Sep, 2023), we'll use original/legacy data from papers. Because improve_utils.py is used to load our benchmark datasets, it should not be a requirement for this release.

wilke commented 1 year ago

ok, please update the code and close the issue when done and train.sh 4 ./tmp is working. I will update the image and retest. Thanks.

rajeeja commented 1 year ago

HPO is working on lambda now @jonesse3 how is Polaris run?

jonesse3 commented 1 year ago

@rajeeja I couldn't get it to run on Polaris yet.

(2022-09-08/IMPROVE) jonesse3@polaris-login-01:~/data_dir/tCNNS/Output/EXP003> cat run_01_001_0001/model.log 
2023-09-21 22:12:20 MODEL.SH: START
2023-09-21 22:12:20 MODEL.SH: MODEL_NAME: /home/jonesse3/improve/Singularity/images/tCNNS.sif
2023-09-21 22:12:20 MODEL.SH: RUNID: run_01_001_0001
2023-09-21 22:12:20 MODEL.SH: HOST: x3106c0s19b0n0
2023-09-21 22:12:20 MODEL.SH: ADLB_RANK_SELF: 0
2023-09-21 22:12:20 MODEL.SH: ADLB_RANK_OFFSET: 0
2023-09-21 22:12:20 MODEL.SH: CUDA DEVICE: 0
2023-09-21 22:12:20 MODEL.SH: MODEL_TYPE: SINGULARITY
2023-09-21 22:12:20 MODEL.SH: source_site(): sourcing /home/jonesse3/Supervisor/workflows/common/sh/langs-app-polaris.sh

2023-09-21 22:12:20 MODEL.SH: PARAMS:
  learning_rate   0.0021497338790694197 
  batch_size      64              
  num_epochs      20              

2023-09-21 22:12:20 MODEL.SH: USING PYTHON: /grand/CSC249ADOA01/public/sfw/polaris/Miniconda-2023-06-16/bin/python3
2023-09-21 22:12:20 MODEL.SH: VERSION: Python 3.9.12

APP_PYTHONPATH:

     1  /home/jonesse3/Supervisor/workflows/common/python
     2  /home/jonesse3/Supervisor/models/OneD
     3  /home/jonesse3/Supervisor/models/Random
     4  /home/jonesse3/Supervisor/models/Comparator
     5  /home/jonesse3/Supervisor/workflows/common/ext/EQ-Py
--

PYTHONPATH:
     1  /grand/CSC249ADOA01/public/sfw/polaris/swift-t/2023-08-31/turbine/py

     2  /home/jonesse3/Supervisor/workflows/common/python
     3  /home/jonesse3/Supervisor/models/OneD
     4  /home/jonesse3/Supervisor/models/Random
     5  /home/jonesse3/Supervisor/models/Comparator
     6  /home/jonesse3/Supervisor/workflows/common/ext/EQ-Py
     7  /home/jonesse3/Supervisor/workflows/common/ext/EQ-Py
     8  /home/jonesse3/Supervisor/workflows/common/python
--

LD_LIBRARY_PATH:
     1  /opt/nvidia/hpc_sdk/Linux_x86_64/21.9/comm_libs/nvshmeme/lib
     2  /opt/nvidia/hpc_sdk/Linux_x86_64/21.9/comm_libs/nccl/lib
     3  /opt/nvidia/hpc_sdk/Linux_x86_64/21.9/math_libs/lib64
     4  /opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/lib
     5  /opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilersextras/qd/lib
     6  /opt/nvidia/hpc_sdk/Linux_x86_64/21.9/cudaextras/CUPTI/lib64
     7  /opt/nvidia/hpc_sdk/Linux_x86_64/21.9/cuda/lib64
     8  /soft/compilers/cudatoolkit/cuda-11.6.2/extras/CUPTI/lib64
     9  /soft/compilers/cudatoolkit/cuda-11.6.2/lib64
    10  /soft/libraries/trt/TensorRT-8.4.3.1.Linux.x86_64-gnu.cuda-11.6.cudnn8.4/lib
    11  /soft/libraries/nccl/nccl_2.14.3-1+cuda11.6_x86_64/lib
    12  /soft/libraries/cudnn/cudnn-11.6-linux-x64-v8.4.1.50/lib
    13  /opt/cray/pe/papi/6.0.0.14/lib64
    14  /opt/cray/libfabric/1.11.0.4.125/lib64
    15  /dbhome/db2cat/sqllib/lib64
    16  /dbhome/db2cat/sqllib/lib64/gskit
    17  /dbhome/db2cat/sqllib/lib32
    18  /lus/grand/projects/CSC249ADOA01/public/sfw/polaris/R-4.2.2/lib64/R/lib
--

PYTHONHOME=

2023-09-21 22:12:21 MODEL.SH: MODEL_CMD: singularity exec --nv --bind /home/jonesse3/data_dir:/candle_data_dir /home/jonesse3/improve/Singularity/images/tCNNS.sif train.sh 0 /candle_data_dir --learning_rate 0.0021497338790694197 --batch_size 64 --num_epochs 20 --experiment_id EXP003 --run_id run_01_001_0001

--learning_rate is not a file
CMD = python /usr/local/tCNNS-Project/tcnns_baseline_tensorflow.py --learning_rate 0.0021497338790694197 --batch_size 64 --num_epochs 20 --experiment_id EXP003 --run_id run_01_001_0001
using CUDA_VISIBLE_DEVICES 0
using CANDLE_DATA_DIR /candle_data_dir
using CANDLE_CONFIG 
running command python /usr/local/tCNNS-Project/candle_data_download.py
Importing candle utils for Keras
Unpacking file...
running command python /usr/local/tCNNS-Project/tcnns_baseline_tensorflow.py --learning_rate 0.0021497338790694197 --batch_size 64 --num_epochs 20 --experiment_id EXP003 --run_id run_01_001_0001
2023-09-21 22:12:30.390667: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2023-09-21 22:12:30.408679: F tensorflow/compiler/xla/parse_flags_from_env.cc:221] Unknown flags in XLA_FLAGS: --xla_gpu_force_compilation_parallelism=1  
Perhaps you meant to specify these on the TF_XLA_FLAGS envvar?
/usr/local/bin/train.sh: line 63: 57496 Aborted                 (core dumped) CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES} CANDLE_DATA_DIR=${CANDLE_DATA_DIR} $CMD

2023-09-21 22:12:31 MODEL.SH: SINGULARITY: EXIT CODE: 134

2023-09-21 22:12:31 MODEL.SH: MODEL ERROR! (CODE=134)
2023-09-21 22:12:31 MODEL.SH: ABORTING WORKFLOW (exit 1)
wilke commented 1 year ago

Polaris is not a target.

Try lambda and make sure train.sh works first. Then use the supervisor.

wilke commented 1 year ago

barely made it - passed basic tests