amaiya / ktrain

ktrain is a Python library that makes deep learning and AI more accessible and easier to apply
Apache License 2.0
1.22k stars 269 forks source link

How to convert old ktrain model(pre 0.21.4) to the latest ktrain without retraining #480

Closed RAbraham closed 1 year ago

RAbraham commented 1 year ago

Hi, We have an old model trained with ktrain (0.21.4). We have the .h5 file and the .preproc file. However, we have lost a reference to the data that we used to train it(perhaps even the training code). :(

We have now been tasked to upgrade all our libraries like ktrain to solve security violations. Is there a way of converting the above files to be compatible with the latest ktrain libraries?

For additional context, the original model was trained with ktrain 0.21.4 + TF 2.2. I was able to bring it up to ktrain 0.25.4 and TF 2.11. But some dependencies like scikit-learn and numpy still have some security violations. So I may have to move to the latest ktrain if possible.

I did have a look at this advice: https://github.com/amaiya/ktrain/commit/e98f8ac8e147090936ab2062bc0869bd0590c144#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5

But I could not find the text mentioned. Perhaps the first question, is, that it looks like a binary format which shows up wierdly in a normal text editor. how do I edit that file in a proper way. Secondly, what I have is below which seems different from the instructions above(as my file may be much older). Let me know if I can edit this and make things work

�cktrain.text.preprocessor
Transformer
q )�q}q(X   cq]q(X   dateqX   originqX   otherqX   summaryqX   titleq eX   maxlenq
K@X   langqX   enqX
   multilabelq
�X   preprocess_train_calledq�X
   label_encoderqcsklearn.preprocessing.label
LabelEncoder
q)�q}q(X   classes_qcnumpy.core.multiarray
_reconstruct
qcnumpy
ndarray
qK �qCbq�qRq(KK�qcnumpy
dtype
qX   U7q���qRq(KX   <qNNNKKKtq b�C�d   a   t   e               o   r   i   g   i   n       o   t   h   e   r           s   u   m   m   a   r   y   t   i   t   l   e           q!tq"bX   _sklearn_versionq#X   0.21.3q$ubX
   model_nameq%X   distilbert-base-uncasedq&X   nameq'X
   distilbertq(X   configq)NX
   model_typeq*ctransformers.modeling_tf_distilbert
TFDistilBertForSequenceClassification
q+X   tokenizer_typeq,ctransformers.tokenization_distilbert
DistilBertTokenizer
q-X   tok_dctq.NX   max_featuresq/M'X   ngram_rangeq0KX
   batch_sizeq1NX   use_with_learnerq2�ub.

Appreciate any advice you can offer. Thanks!

amaiya commented 1 year ago

Hi @RAbraham : Yes, you should be able to edit it to successfully upgrade to the latest versions of ktrain and transformers. I would recommend using an editor like vim or emacs, which may make it easier to locate what needs to be changed. When v0.26 of ktrain was released, a number of users used vim or emacs to successfully edit/convert older preproc files to work with newer versions of transformers and ktrain.

I will close this issue, but, if you have problems, please don't hesitate to respond in this thread.

RAbraham commented 1 year ago

Hi @amaiya thank you very much for your quick response. I will use vim. I was looking at the instructions and saw

   - change `transformers.configuration_distilbert` to `transformers.models.distilbert.configuration_distilbert`
    - change `transformers.modeling_tf_auto` to `transformers.models.auto.modeling_tf_auto`
    - change `transformers.tokenization_auto` to `transformers.models.auto.tokenization_auto

I don't see the text mentioned above in my preproc file(I've also pasted it above for your reference). Do I need to change anything if that's the case?

amaiya commented 1 year ago

It's probably because another version of ktrain switched to TFAutoModel and AutoTokenizer. But, the process should be the same. The overall goal is to translate the existing module locations in your .preproc file to the new module locations in the current version of transformers, and things should just work. So transformers.tokenization_distilbert in your file might be changed to transformers.models.distilbert.tokenization_distilbert (or maybe transformers.models.distilbert.tokenization_distilbert_fast):

(base) user@machine:$ pwd
/home/user/mambaforge/lib/python3.9/site-packages/transformers/models/distilbert
(base) user@machine:$ ls
configuration_distilbert.py  __init__.py  modeling_distilbert.py  modeling_flax_distilbert.py  modeling_tf_distilbert.py  __pycache__  tokenization_distilbert_fast.py  tokenization_distilbert.py
RAbraham commented 1 year ago

I'll try it out and let you know. Thanks!