Merck / deepbgc

BGC Detection and Classification Using Deep Learning
https://doi.org/10.1093/nar/gkz654
MIT License
127 stars 27 forks source link

some problems while training my model #16

Closed youngDouble closed 5 years ago

youngDouble commented 5 years ago

Hello! Recently I used deepbgc in my work, but I encountered some problems while training my model. My code:

Nohup deepbgc train --model deepbgc.json --output DeepBGC_antigen_model.pkl --config PFAM2VEC pfam2vec.csv Oantigen.pfam.tsv GeneSwap_Negatives.pfam.tsv &

It can start running, but after a while, he seems to be deadlocked: The end of the log :

Epoch 76/1000
  - 143s - loss: 0.0028 - acc: 0.9993 - precision: 0.8799 - recall: 0.8642 - auc_roc: 0.9911
Epoch 77/1000
  - 148s - loss: 0.0023 - acc: 0.9992 - precision: 0.8969 - recall: 0.8722 - auc_roc: 0.9913

This state has been going on for a long time (about 2 days, my input file is only 2M, including 515 BGC) What can I do? thank you!

prihoda commented 5 years ago

Hi @youngDouble,

so the log output is hanging at "Epoch 77"? Since you are running with nohup, are you sure that the process is still running? It could have crashed, for example on unsufficient memory. Can you try running the training again?

I would also suggest running with less epochs, e.g. 100, since the accuracy seems quite high already on epoch 77. Also, if you want to try reserving part of training data for validation, you can set "validation_size" in your JSON file. For example, you can use "validation_size": 0.2 for validating your model on 20% of your data. Then, both training and validation metrics will be shown during training.

prihoda commented 5 years ago

Also, please see the updated README section on training: https://github.com/Merck/deepbgc#train-deepbgc-on-your-own-data

You can upgrade DeepBGC to version 0.1.10 to be able to provide your trained model for detection and classification like so:

deepbgc pipeline \
    mySequence.fa \
    --detector path/to/myDetector.pkl \
    --classifier path/to/myClassifier.pkl 
youngDouble commented 5 years ago

Thank you for your reply, as you said, it is a problem with insufficient memory. After adjusting the parameters(validation_size and num_epochs), it can work. It seems that the training process uses a lot of memory (more than 256G), I have delivered the task to the cluster.