Warning and missing checkpoint with model training on gpu

pterzian commented 4 years ago

Hi Peng,

So I have a couple of new questions regarding some issues I had with model training on gpu.

First I add a warning I'm not sure to have understood with a 20M samples training file and 10K to validate. This the gpu I use :
```
name: Tesla K40m major: 3 minor: 5 memoryClockRate(GHz): 0.745
totalMemory: 11.17GiB freeMemory: 11.07GiB
```
This is the warning :

2020-03-29 17:42:04.757305: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.87GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

So I guess it's a memory issue ? Do you know if it is affecting computing speed or anything ?

The training process was stopped after 2 days (I work on a slurm env) during epoch 3. The issue is I only have checkpoint of epoch 0 and 1. Do you think it is related to the memory issue ?

This brings me to my second issue. This time I did train with a 10M samples file. It was the first time the training finished on its own, these are the last line of the ouput :

================ epoch 5 best accuracy: 0.835, best accuracy: 0.835
training finished, costs 161783.9 seconds..

Unfortunately I have only up to checkpoint 4 :

model_checkpoint_path: "bn_17.sn_360.epoch_4.ckpt"
all_model_checkpoint_paths: "bn_17.sn_360.epoch_0.ckpt"
all_model_checkpoint_paths: "bn_17.sn_360.epoch_1.ckpt"
all_model_checkpoint_paths: "bn_17.sn_360.epoch_2.ckpt"
all_model_checkpoint_paths: "bn_17.sn_360.epoch_3.ckpt"
all_model_checkpoint_paths: "bn_17.sn_360.epoch_4.ckpt"

I was hoping you could tell me a bit more about when the training should end ? I am using default paremeters. I also had one warning also during this training :

/usr/local/bioinfo/src/DeepSignal/deepsignal-0.1.6-gpu/deepsignal-0.1.6-gpu_venv/lib/python3.6/site-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))

Thanks a lot for your help!

Paul

PengNi commented 4 years ago

Hi Paul,

The first warning doesn't affect the training result. It says that if there are more GPU memory, the training process could be faster.
The checkpoint starts at 0, but the epoch counter in the log start at 1. checkpoint4 is the reslut of the fifth epoch. Sorry for the confusion.

Best, Peng

pterzian commented 4 years ago

Ok this I understand.

What I still don't understand though is why the training "finished" at epoch 5. As I was saying I am using default parameters (which is a maximum of 10 epoch I believe). Is the training process able to stop itself even if it has not achieve the max number of epoch ? For instance, does it check the ouput statistics to test if it the model will or will not improve with more training ?

Best, Paul

PengNi commented 4 years ago

Paul, we compare the valid accuracy of the current epoch with the accuracy of last epoch. If the current epoch's accuracy is lower than the accuracy of last epoch, we stop the process.

Best, Peng

pterzian commented 4 years ago

Hi Peng,

Once again my training has a completed but it is missing the checkpoint of epoch 2, 3 and 4. Seems like it stopped after the second one. Is there no other way to use one model without its checkpoint ? I would like to call modifications with this model.

This are the last lines of stdout

================ epoch 4 best accuracy: 0.822, best accuracy: 0.828
training finished, costs 269176.9 seconds..

I also have this message in stderror (but I guess it more related to end of training) :

/usr/local/bioinfo/src/DeepSignal/deepsignal-0.1.6-gpu/deepsignal-0.1.6-gpu_venv/lib/python3.6/site-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

Thanks !

Paul

PengNi commented 4 years ago

Hi Paul, it seems that the training process has completed with no error. I think it is OK to use /path/to/bn_17.sn_360.epoch_4.ckpt (the files for the last trained epoch) as the model path. There is only one checkpoint file named "checkpoint", so I don't think any checkpoint is missing.

The precision warning is caused by zero_division during the calculation of precision in validation. It doesn't affect the training result.

Best, Peng

pterzian commented 4 years ago

Sorry for not being clear, there is no bn_17.sn_360.epoch_4.ckpt in the model directory.

And this is the whole content of the "checkpoint" file :

model_checkpoint_path: "bn_17.sn_360.epoch_1.ckpt"
all_model_checkpoint_paths: "bn_17.sn_360.epoch_0.ckpt"
all_model_checkpoint_paths: "bn_17.sn_360.epoch_1.ckpt"

Best, Paul

PengNi commented 4 years ago

in this case, I think in the training, epoch 3,4,5 all don't have better valid accuracy than epoch 2. So the model files of epoch 3,4,5 didn't be saved.

Best, Peng

pterzian commented 4 years ago

Ok thanks I think I understand what happened looking at these lines :

================ epoch 0 best accuracy: 0.820, best accuracy: 0.820
================ epoch 1 best accuracy: 0.828, best accuracy: 0.828
================ epoch 2 best accuracy: 0.823, best accuracy: 0.828
================ epoch 3 best accuracy: 0.826, best accuracy: 0.828
================ epoch 4 best accuracy: 0.822, best accuracy: 0.828

I am just surprised the training didn't stop after epoch 2.

Also I am surprised of this low accuracy, I would have thought training on 20M samples would improve accuracy (0.828 with 20M samples vs 0.835 with 10M samples). I am thinking of doing some filtering on samples selection. Would you have some advices for this or regarding the improvment of training in general?

Best, Paul

PengNi commented 4 years ago

Hi Paul, it is because that we set an argument --min_epoch_num (default 5) to prevent the training process stop too early.

I am not sure what your motif is and how you select your training samples. In our tests, keep the kmer distribution the same between positive and negative samples may improve the accuracy. Also you can try using shorter --kmer_len, 15, 13 even 11.

Best, Peng

pterzian commented 4 years ago

Hi Peng, sorry for the late answer.

Actually I am trying to train CpG models for different animal species more or less close to human. As for now I am using fully methylated/unmethylated datasets but I plan testing the bisulfite approach.

What I would like to do now is making a small reproducible exemple. I selected the 10 most covered kmers and made a half methylated half unmethylated training dataset of 51200 samples exactly and a 10k samples validation dataset. Unfortunately I can't get more than 0.7 valid accuracy.

Would it be possible to train on an even smaller dataset (like only one kmer) and still get a high accuracy ? In the same idea do you think we could try generating random features for training?

Also could you tell me more about the distribution of kmer in your human HX1/HX1 training file of 20M sample ? Did you try to train on the highest number of kmer as possible or controled their coverage ?

Last I tried to plot the nucleotide signal distribution for some kmers like the supplementary data boxplots of your publication. Could you validate to me that these boxplots are made from the mean signals values that I can find in the extracted features files (8th field) ?

By the way some of the "side scripts" in deepsignal/scripts seem pretty usefull, some documentation would be welcomed ! I feel evaluate_mods_call.py would be usefull used with my fully methylated/unmethylated data.

Sorry for posting so much questions in one message !

Best,

Paul

PengNi commented 4 years ago

Hi Paul,

Thank you for your interests of deepsignal.

In my opnion, the training dataset should include as much as possible kmers, also enough and balanced samples for each kmer, to get a better model. I don't think using samples of parts kmer to train, can get a satisfied performance on other kmers.

We use randomly selected samples from all samples of fully methylated and unmethylated sites to train a model for hx1. Using controlled samples of kmers may be a better idea. But we did not try.

We use tombo to draw the boxplots in the supplementary. Using the mean signals from the feature file can get similar plots.

"evaluate_mods_call.py" is used to validate. To calculate accuracy and other metrics, we first select both methylated and unmethylated sites with high confidence, based on bisulfite sequencing or methyltransferase-treated data. The selected sites have clear labels at read level, thus can be used to evaluate. the details are in section 2.1.2 and 2.2.7 of our paper. The input of this script are two files: one is the deepsignal result of all unmethylated sites, another is the deepsignal result of all methylated sites.

Sorry for the late response.

Best, Peng

pterzian commented 4 years ago

Thanks for your answer Peng,

I will try to understand more how the neural network works with these input (extracted features) so I can try generate fake data for testing, I might open new issues for questions.

Meanwhile and related to #3 and #22 I also have this thread problem and this is actually making the CPU methylation calling very long (more than 4 days for big runs). Do you mind if I reopen an issue about it and do some test with the code?

Best, Paul

bioinfomaticsCSU / deepsignal

Warning and missing checkpoint with model training on gpu #41