amaiya / ktrain

ktrain is a Python library that makes deep learning and AI more accessible and easier to apply
Apache License 2.0
1.23k stars 269 forks source link

Reloading learner from saved predictor model files #162

Closed rcmcabral closed 4 years ago

rcmcabral commented 4 years ago

Hi. Thanks for the awesome tool! I managed to get it to work as described in the tutorials. However, I'm using BERT with a huge dataset so one epoch takes hours. On top of that, I'm using Google Colab which has time limits for GPU use. Because of this, I was hoping to save the model, reload and then call learner.fit_onecycle again to continue the training for some more epochs.

I have a successfully saved the predictor files from a few epochs and I can reload it to make predictions. What I'm hoping to do now is get the learner class from it but looking at the source code, there's no way to do this outright. I moved to trying to load the model file itself and build the learner by calling ktrain.get_learner() again but ktrain.load_model() throws an error of

Unknown layer: TokenEmbedding

I've also thought about going through the entire process again up to building the model as prescribed then setting weights and getting learner:

model = text.text_classifier("bert", train_data = (xTrain, yTrain), preproc = preproc)
#Set model weights here using model.load_weights(pathToCheckpointFile)
learner = ktrain.get_learner(model, train_data = (xTrain, yTrain), batch_size = 12)

This feels kinda hackish though since I'm not using the saved model files. Will this have the same effect or am I missing something from the source code in building the learner from the predictor?

amaiya commented 4 years ago

Thanks for your comments. To re-create the Learner in order to continue training when logging back into Google Colab, you need the model and the training data at minimum, as ktrain inspects both to automate things for ease of use.

When you call predictor.save, it saves the model (in addition to the Preprocessor instance). So, assuming you saved the Predictor instance at the end of the initial training session, you can re-create the Learner instance as follows:

import ktrain
predictor = ktrain.load_predictor('/path_to_saved_predictor')
learner = ktrain.get_learner(predictor.model, train_data = (xTrain, yTrain), batch_size = 12)
# continue training here (e.g., learner.fit_onecycle)

Note that there is also the methods learner.save_model and learner.load_model, but these are intended to be used to save and reload models during interactive training (so you can go back to an earlier model if you end up overfitting).

Hope this helps.

P.S. The ktrain.load_model function (as opposed to learner.load_model) is actually a reference to the load_model function in Keras (which is used internally) and will probably be removed from ktrain namespace in future versions of ktrain.

You might also consider using DistilBert, which often has nearly the same performance but half the parameters using either the text_classifier API or te Transformer API.

rcmcabral commented 4 years ago

Thanks for the prompt reply @amaiya ! Works like a charm! I guess I got lost in looking for a load_model function that would accept the saved tf_model.h5 file that I didn't notice predictor.model is exposed.

Also, thanks for the suggestions! Will look into them after!

Looking forward to future, leaner versions. Thank you for your work!

msclar commented 4 years ago

I am training a model using fasttext in Colab following this response but I'm getting an error. I saved the predictor like this:

predictor = ktrain.get_predictor(learner.model, preproc)
predictor.save(path)

Loaded it again:

predictor = ktrain.load_predictor(path)
learner = ktrain.get_learner(predictor.model, train_data = train, val_data = val, batch_size = 12)

And when I tried to train another cycle it failed. More precisely, upon running learner.fit_onecycle(2e-4, 1, class_weight=class_weights) I got the following error:

begin training using onecycle policy with max lr of 0.0002...
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-54-93fc30f9a17a> in <module>()
----> 1 learner.fit_onecycle(2e-4, 1, class_weight=class_weights)

18 frames
/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py in wrapper(*args, **kwargs)
    235       except Exception as e:  # pylint:disable=broad-except
    236         if hasattr(e, 'ag_error_metadata'):
--> 237           raise e.ag_error_metadata.to_exception(e)
    238         else:
    239           raise

ValueError: in converted code:

    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py:677 map_fn
        batch_size=None)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py:2410 _standardize_tensors
        exception_prefix='input')
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_utils.py:510 standardize_input_data
        'for each key in: ' + str(names))

    ValueError: No data provided for "embedding_4_input". Need data for each key in: ['embedding_4_input']

Am I failing to follow the response correctly? What could the error be? Thanks in advance!

amaiya commented 4 years ago

Hi @msclar I suspect that it is being caused by the way you're loading your data. The fasttext model is not a pretrained model like BERT or DistilBert. As a result, instead of using a preset vocabulary, it learns the vocabulary from the training data. The embedding layers of the fasttext model are configured based on this original learned vocabulary. If you reload a different training set from scratch, it will have a new vocabulary which will confuse the embedding layer.

If you follow the same steps you provided above but continue training using the same original training set (or a training set preprocesed using the same tokenizer learned from original training set),, the error does not occur. It will also be avoided if you use a pretrained model like BERT or DistilBert.

msclar commented 4 years ago

I was caught by the downsides of using a notebook!! I did load the dataset fixing the random_state to get the same training set, but the variable train was referring to a dataset processed with DistilBERT in another part of my notebook.

Thank you for the swift reply, I only realized the bug because of your reply!

sathish331977 commented 4 years ago

While training for NER ,my system errored at 10th epoch and training stopped. No specific error written in logs. I have enabled checkpoint and have the hd5 files written for each epoch. I tried to load the 10th epoch file using the following line of code learner = ktrain.load_model('../models/checkpoints/weights-10.hdf5')

received the following error Traceback (most recent call last): File "/home/user1234/miniconda3/envs/ktrain/lib/python3.7/site-packages/tensorflow_core/python/keras/saving/hdf5_format.py", line 165, in load_model_from_hdf5 raise ValueError('No model found in config file.') ValueError: No model found in config file. While the checkpoints are written it does not write the "model_config", which the above code is trying to load , resulting in error. Is there a way to retrain from where it last stopped. There is no final model file written and dont have .preproc file.

amaiya commented 4 years ago

@sathish331977 : The checkpoint_folder argument saves only the weights of the model after each epoch, so use load_weights:

# recreate model from scratch
txt.sequence_tagger(...
# load checkpoint weights into model
model.load_weights('../models/checkpoints/weights-10.hdf5')
# recreate learner
learner = ktrain.get_learner(model, ...
# continue training here
sirisha-8 commented 4 years ago

@amaiya @sathish331977 Hi,

Could you please look into this issue.

learner=ktrain.get_learner(predictor.model, train_data=trn, val_data=val, batch_size=128) learner.fit(0.005, 1, cycle_len=20, checkpoint_folder='training_data_new/after_30')

with this i am getting this error. `/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py in wrapper(*args, **kwargs) 971 except Exception as e: # pylint:disable=broad-except 972 if hasattr(e, "ag_error_metadata"): --> 973 raise e.ag_error_metadata.to_exception(e) 974 else: 975 raise

AssertionError: in user code:

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:806 train_function  *
    return step_function(self, iterator)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:796 step_function  **
    outputs = model.distribute_strategy.run(run_step, args=(data,))
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:1211 run
    return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2585 call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2945 _call_for_each_replica
    return fn(*args, **kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:789 run_step  **
    outputs = model.train_step(data)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:747 train_step
    y_pred = self(x, training=True)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py:985 __call__
    outputs = call_fn(inputs, *args, **kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/functional.py:386 call
    inputs, training=training, mask=mask)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/functional.py:517 _run_internal_graph
    assert x_id in tensor_dict, 'Could not compute output ' + str(x)

AssertionError: Could not compute output Tensor("dense_1/truediv:0", shape=(None, None, 66), dtype=float32)`

I am loading thr model using ktrain.load_predictor and then using already trained model by predictor.model and got learner obj. so when i am further training i got the following error could you please look into this.

amaiya commented 4 years ago

@sirisha-8: You haven't provided enough information, as it's not clear what model you're using, what task you're performing, or what TensorFlow version you're using, etc. I've tested this on my end with transformers-based text classification and everything works. If you still have trouble, please open a new issue with more details including a self-contained reproducible example, if possible.

sirisha-8 commented 4 years ago

@amaiya sorry for the lack of info. I am using biobert model model = txt.sequence_tagger('bilstm-bert', preproc, bert_model='monologg/biobert_v1.1_pubmed') The task i am performing is NER. Tensorflow version is 2.1.0 I have already trained for 30 epochs using learner.fit and saved predictor also. Now after loading model using ktrain.load_predictor and continue training using learner.fit for further 20 epochs I got above error. Not sure where i went wrong.Could you please look into this issue .I am happy to provide further details regarding this

amaiya commented 4 years ago

@sirisha-8 I wasn't able to reproduce this issue and everything works fine for me. You'll need to provide a self-contained, reproducible example on Google Colab.