Closed wilke closed 1 year ago
@wilke I think for the upcoming release (end of Sep, 2023), we'll use original/legacy data from papers. Because improve_utils.py is used to load our benchmark datasets, it should not be a requirement for this release.
ok, please update the code and close the issue when done and train.sh 4 ./tmp
is working. I will update the image and retest. Thanks.
HPO is working on lambda now @jonesse3 how is Polaris run?
@rajeeja I couldn't get it to run on Polaris yet.
(2022-09-08/IMPROVE) jonesse3@polaris-login-01:~/data_dir/tCNNS/Output/EXP003> cat run_01_001_0001/model.log
2023-09-21 22:12:20 MODEL.SH: START
2023-09-21 22:12:20 MODEL.SH: MODEL_NAME: /home/jonesse3/improve/Singularity/images/tCNNS.sif
2023-09-21 22:12:20 MODEL.SH: RUNID: run_01_001_0001
2023-09-21 22:12:20 MODEL.SH: HOST: x3106c0s19b0n0
2023-09-21 22:12:20 MODEL.SH: ADLB_RANK_SELF: 0
2023-09-21 22:12:20 MODEL.SH: ADLB_RANK_OFFSET: 0
2023-09-21 22:12:20 MODEL.SH: CUDA DEVICE: 0
2023-09-21 22:12:20 MODEL.SH: MODEL_TYPE: SINGULARITY
2023-09-21 22:12:20 MODEL.SH: source_site(): sourcing /home/jonesse3/Supervisor/workflows/common/sh/langs-app-polaris.sh
2023-09-21 22:12:20 MODEL.SH: PARAMS:
learning_rate 0.0021497338790694197
batch_size 64
num_epochs 20
2023-09-21 22:12:20 MODEL.SH: USING PYTHON: /grand/CSC249ADOA01/public/sfw/polaris/Miniconda-2023-06-16/bin/python3
2023-09-21 22:12:20 MODEL.SH: VERSION: Python 3.9.12
APP_PYTHONPATH:
1 /home/jonesse3/Supervisor/workflows/common/python
2 /home/jonesse3/Supervisor/models/OneD
3 /home/jonesse3/Supervisor/models/Random
4 /home/jonesse3/Supervisor/models/Comparator
5 /home/jonesse3/Supervisor/workflows/common/ext/EQ-Py
--
PYTHONPATH:
1 /grand/CSC249ADOA01/public/sfw/polaris/swift-t/2023-08-31/turbine/py
2 /home/jonesse3/Supervisor/workflows/common/python
3 /home/jonesse3/Supervisor/models/OneD
4 /home/jonesse3/Supervisor/models/Random
5 /home/jonesse3/Supervisor/models/Comparator
6 /home/jonesse3/Supervisor/workflows/common/ext/EQ-Py
7 /home/jonesse3/Supervisor/workflows/common/ext/EQ-Py
8 /home/jonesse3/Supervisor/workflows/common/python
--
LD_LIBRARY_PATH:
1 /opt/nvidia/hpc_sdk/Linux_x86_64/21.9/comm_libs/nvshmeme/lib
2 /opt/nvidia/hpc_sdk/Linux_x86_64/21.9/comm_libs/nccl/lib
3 /opt/nvidia/hpc_sdk/Linux_x86_64/21.9/math_libs/lib64
4 /opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/lib
5 /opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilersextras/qd/lib
6 /opt/nvidia/hpc_sdk/Linux_x86_64/21.9/cudaextras/CUPTI/lib64
7 /opt/nvidia/hpc_sdk/Linux_x86_64/21.9/cuda/lib64
8 /soft/compilers/cudatoolkit/cuda-11.6.2/extras/CUPTI/lib64
9 /soft/compilers/cudatoolkit/cuda-11.6.2/lib64
10 /soft/libraries/trt/TensorRT-8.4.3.1.Linux.x86_64-gnu.cuda-11.6.cudnn8.4/lib
11 /soft/libraries/nccl/nccl_2.14.3-1+cuda11.6_x86_64/lib
12 /soft/libraries/cudnn/cudnn-11.6-linux-x64-v8.4.1.50/lib
13 /opt/cray/pe/papi/6.0.0.14/lib64
14 /opt/cray/libfabric/1.11.0.4.125/lib64
15 /dbhome/db2cat/sqllib/lib64
16 /dbhome/db2cat/sqllib/lib64/gskit
17 /dbhome/db2cat/sqllib/lib32
18 /lus/grand/projects/CSC249ADOA01/public/sfw/polaris/R-4.2.2/lib64/R/lib
--
PYTHONHOME=
2023-09-21 22:12:21 MODEL.SH: MODEL_CMD: singularity exec --nv --bind /home/jonesse3/data_dir:/candle_data_dir /home/jonesse3/improve/Singularity/images/tCNNS.sif train.sh 0 /candle_data_dir --learning_rate 0.0021497338790694197 --batch_size 64 --num_epochs 20 --experiment_id EXP003 --run_id run_01_001_0001
--learning_rate is not a file
CMD = python /usr/local/tCNNS-Project/tcnns_baseline_tensorflow.py --learning_rate 0.0021497338790694197 --batch_size 64 --num_epochs 20 --experiment_id EXP003 --run_id run_01_001_0001
using CUDA_VISIBLE_DEVICES 0
using CANDLE_DATA_DIR /candle_data_dir
using CANDLE_CONFIG
running command python /usr/local/tCNNS-Project/candle_data_download.py
Importing candle utils for Keras
Unpacking file...
running command python /usr/local/tCNNS-Project/tcnns_baseline_tensorflow.py --learning_rate 0.0021497338790694197 --batch_size 64 --num_epochs 20 --experiment_id EXP003 --run_id run_01_001_0001
2023-09-21 22:12:30.390667: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2023-09-21 22:12:30.408679: F tensorflow/compiler/xla/parse_flags_from_env.cc:221] Unknown flags in XLA_FLAGS: --xla_gpu_force_compilation_parallelism=1
Perhaps you meant to specify these on the TF_XLA_FLAGS envvar?
/usr/local/bin/train.sh: line 63: 57496 Aborted (core dumped) CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES} CANDLE_DATA_DIR=${CANDLE_DATA_DIR} $CMD
2023-09-21 22:12:31 MODEL.SH: SINGULARITY: EXIT CODE: 134
2023-09-21 22:12:31 MODEL.SH: MODEL ERROR! (CODE=134)
2023-09-21 22:12:31 MODEL.SH: ABORTING WORKFLOW (exit 1)
Polaris is not a target.
Try lambda and make sure train.sh works first. Then use the supervisor.
barely made it - passed basic tests
Command:
singularity exec --bind ${TEST_DIR}:/candle_data_dir build/tCNNS.sif train.sh 5 /candle_data_dir --epochs 1
Status: In progress Depends on:Output: