Segmentation fault while training

jairomrojas commented 5 years ago

Dear SparseEventID team, after an certain number of iterations in the training process I received an error message "Segmentation fault (core dumped)". I am working on CentOS 7 in a cluster, and the command I use to train is:

python ~/Software_python23/SparseEventID/bin/resnet.py train --learning-rate 0.01 --file ~/sf_SBN_Wire_Pixel_Files/out_2D_train_rand.h5 --iterations 100 --compute-mode CPU .

The message that appears in the end is:

Total number of trainable parameters in this network: 118218 ./log/checkpoints/checkpoint Returning none! train Step 0 metrics: loss: 3.58, acc/label_neut: 0.0, acc/label_prot: 0.0, acc/label_cpi: 0.5, acc/label_npi: 0.0 train Step 1 metrics: loss: 3.58, acc/label_neut: 0.5, acc/label_prot: 0.0, acc/label_cpi: 0.0, acc/label_npi: 0.0 (0.074s / 0.031 IOs / 0.0042) train Step 2 metrics: loss: 3.58, acc/label_neut: 0.0, acc/label_prot: 0.0, acc/label_cpi: 0.0, acc/label_npi: 0.0 (0.089s / 0.021 IOs / 0.0041) train Step 3 metrics: loss: 3.42, acc/label_neut: 0.5, acc/label_prot: 0.5, acc/label_cpi: 0.5, acc/label_npi: 0.5 (0.25s / 0.1 IOs / 0.01) train Step 4 metrics: loss: 3.58, acc/label_neut: 0.0, acc/label_prot: 0.0, acc/label_cpi: 0.0, acc/label_npi: 0.5 (0.23s / 0.13 IOs / 0.011) train Step 5 metrics: loss: 3.58, acc/label_neut: 0.5, acc/label_prot: 0.5, acc/label_cpi: 0.0, acc/label_npi: 0.5 (0.13s / 0.024 IOs / 0.01) train Step 6 metrics: loss: 3.54, acc/label_neut: 0.0, acc/label_prot: 0.5, acc/label_cpi: 1.0, acc/label_npi: 0.5 (0.16s / 0.024 IOs / 0.011) train Step 7 metrics: loss: 3.57, acc/label_neut: 0.5, acc/label_prot: 0.0, acc/label_cpi: 0.0, acc/label_npi: 0.5 (0.15s / 0.024 IOs / 0.01) train Step 8 metrics: loss: 3.49, acc/label_neut: 0.5, acc/label_prot: 0.5, acc/label_cpi: 0.5, acc/label_npi: 1.0 (0.15s / 0.024 IOs / 0.011) train Step 9 metrics: loss: 3.58, acc/label_neut: 0.0, acc/label_prot: 0.0, acc/label_cpi: 0.5, acc/label_npi: 0.0 (0.089s / 0.024 IOs / 0.01) train Step 10 metrics: loss: 3.58, acc/label_neut: 0.0, acc/label_prot: 0.5, acc/label_cpi: 0.0, acc/label_npi: 0.5 (0.087s / 0.021 IOs / 0.0088) train Step 11 metrics: loss: 3.56, acc/label_neut: 0.5, acc/label_prot: 0.0, acc/label_cpi: 0.0, acc/label_npi: 0.5 (0.1s / 0.02 IOs / 0.009) train Step 12 metrics: loss: 3.55, acc/label_neut: 0.0, acc/label_prot: 0.0, acc/label_cpi: 0.5, acc/label_npi: 0.5 (0.16s / 0.019 IOs / 0.0088) train Step 13 metrics: loss: 3.58, acc/label_neut: 0.0, acc/label_prot: 0.0, acc/label_cpi: 0.0, acc/label_npi: 0.0 (0.057s / 0.018 IOs / 0.0085) train Step 14 metrics: loss: 3.58, acc/label_neut: 0.0, acc/label_prot: 0.5, acc/label_cpi: 0.5, acc/label_npi: 0.0 (0.064s / 0.019 IOs / 0.0089) train Step 15 metrics: loss: 3.58, acc/label_neut: 0.0, acc/label_prot: 0.0, acc/label_cpi: 0.0, acc/label_npi: 0.0 (0.3s / 0.18 IOs / 0.01) train Step 16 metrics: loss: 3.58, acc/label_neut: 0.5, acc/label_prot: 0.5, acc/label_cpi: 0.0, acc/label_npi: 0.0 (0.051s / 0.018 IOs / 0.0084) train Step 17 metrics: loss: 3.58, acc/label_neut: 0.5, acc/label_prot: 0.0, acc/label_cpi: 0.0, acc/label_npi: 0.0 (0.089s / 0.019 IOs / 0.0083) train Step 18 metrics: loss: 3.46, acc/label_neut: 0.5, acc/label_prot: 0.0, acc/label_cpi: 0.0, acc/label_npi: 0.5 (0.33s / 0.18 IOs / 0.0082) train Step 19 metrics: loss: 3.49, acc/label_neut: 0.5, acc/label_prot: 0.0, acc/label_cpi: 0.5, acc/label_npi: 0.5 (0.15s / 0.019 IOs / 0.0089) train Step 20 metrics: loss: 3.58, acc/label_neut: 0.0, acc/label_prot: 0.5, acc/label_cpi: 0.5, acc/label_npi: 0.0 (0.12s / 0.019 IOs / 0.0091) train Step 21 metrics: loss: 3.58, acc/label_neut: 0.0, acc/label_prot: 0.0, acc/label_cpi: 0.5, acc/label_npi: 0.0 (0.11s / 0.019 IOs / 0.0084) train Step 22 metrics: loss: 3.76, acc/label_neut: 0.0, acc/label_prot: 0.5, acc/label_cpi: 1.0, acc/label_npi: 0.5 (0.18s / 0.02 IOs / 0.0085) train Step 23 metrics: loss: 3.58, acc/label_neut: 0.0, acc/label_prot: 0.5, acc/label_cpi: 0.0, acc/label_npi: 0.0 (0.069s / 0.018 IOs / 0.0084) Segmentation fault (core dumped)

coreyjadams commented 5 years ago

So, firstly, I think you want to say --compute-mode GPU. Otherwise everything runs on the CPU which is not that fast.

Second, how frequently does this occur? Same time every run or does it happen after a variable number of iterations?

jairomrojas commented 5 years ago

Whenever I use the GPU the "Segmentation fault (core dumped)" appears even before "train Step 0". For "--compute-mode CPU" the number of steps before it crasher varies from around 2 to around 30.

jairomrojas commented 5 years ago

I achieved the training to work with GPU, the problem was related to dependencies and the environment variables of CUDA that I had to modify manually. However, I still have the problem that after a variable number of steps in the training the core dump problem appears.

jairomrojas commented 5 years ago

I tracked the error to "logits = self._net(minibatch_data['image'])" in "trainercore.py".

The error "Segmentation fault (core dumped)" is still there however when I use "--minibatch-size 1" I can perform more train steps. It decreases when I increase this value and when I use 16 or more the message of error appears even before Step 0. I don't think it is related to memory problems as the memory used is small in comparison to what I have available. When I use htop VIRT goes around 150gb and RES around 2.5gb before it dies, the last one increasing a bit when I increase the batch size.

coreyjadams / SparseEventID

Segmentation fault while training #1