MolecularAI / aizynthtrain

Tools to train synthesis prediction models
Apache License 2.0
21 stars 6 forks source link

Several tensorflow-related issues #14

Open JeremyMolineau opened 4 months ago

JeremyMolineau commented 4 months ago

Hello,

I'm having a problem during model training (step 5. in USAGE) for the split_data step. A file uspto_split_indices.npz was not found. After moving from python files to python files, I had difficulty understanding how it was generated and why.

npz_file_missing

During the previous step to generate the final reaction and reaction templates files (step 4 in USAGE), I followed the instructions of reaction_selection.py in a notebook to understand the process. Is there the same type of python files to understand the model training process (step 5)? Thanks for your help.

SGenheden commented 4 months ago

Hello.

This file should have been created by the first step of the modelling pipeline, and contains the indices of the data for the training, validation and test set.

The easiest in these situations is usually to run that part of the pipeline in isolation.

Once, you have activated the conda environment you can run this step with

python -m aizynthtrain.modelling.expansion_policy.create_template_lib expansion_model_pipeline_config.yml

where expansion_model_pipeline_config.yml is the name of the config file.

Try that and see if the split file is generated or if you get another error message.

JeremyMolineau commented 4 months ago

Hello, The .npz file has been created. The split_data process has finished.

I'm encountering a new problem during the model_training step with this output: ValueError: The filepath provided must end in .keras (Keras model format). Received: filepath=uspto_keras_model.hdf5

I don't understand because in the configs.py, the file is specified as : training_checkpoint: str = "keras_model.hdf5" I tried to run it with the command you proposed and got the same error : python -m aizynthtrain.modelling.expansion_policy.training expansion_model_pipeline_config.yml

SGenheden commented 4 months ago

This seems to be an issue with a new Tensorflow version. I will try to find where this is coming from.

But you could try changing the filename in the config file and see if that helps. It doesn't have to end with "hdf5", that we used historically, but if TF now want it to end with ".keras" you can change the config file.

JeremyMolineau commented 4 months ago

It works with ".keras" instead of ".hdf5" I encountered a small problem with 3 parameters in keras_utils.py in the fit() (max_queue_size=20, workers=1 & use_multiprocessing=False). They were not recognized as parameters. The model runs without them. It seems that these parameters are specific to fit_generator and not to fit.

SGenheden commented 4 months ago

Glad you got it working.

It is clear that the Tensorflow code needs to be updated to support the latest version, so thank you for your patience and for reporting these issues.

JeremyMolineau commented 4 months ago

Thank you for your prompt reply.

I'm encountering an error when converting the Keras model to ONNX. TypeError: Unable to reconstruct an instance of 'partial' because the class lacks a from_config() method. Full object config: {'module': 'functools', 'class_name': 'partial', 'config': 'top10_acc', 'registered_name': 'partial'}

I've tried to change the way top10_acc & top50_acc are registered and haven't found a solution to load the model with keras.models load_model()

SGenheden commented 4 months ago

Hard to help on this last one unless you post a more specific error trace or share the trained model. This appears to be a Tensorflow issue and changed functionality on the tensorflow end.

JeremyMolineau commented 3 months ago

You'll find the error message in the image below. What is the best way to share the model I've trained based on the USPTO?

Thank you for your help. onnx_convert_failed

SGenheden commented 3 months ago

I am curious to what versions of tensorflow and keras you have in your environment. According to the latest poetry.lock file you ought to have 2.8.1 for tensorflow and 2.8.0 for keras. I could not reproduce you error with my environment.

You can reproduce the behavior in the pipeline with

import tensorflow as tf
from tensorflow.keras.models import load_model
from tensorflow.keras.metrics import top_k_categorical_accuracy
import functools

top10_acc = functools.partial(top_k_categorical_accuracy, k=10)
top10_acc.__name__ = "top10_acc" 

top50_acc = functools.partial(top_k_categorical_accuracy, k=50)
top50_acc.__name__ = "top50_acc" 

custom_objects = {
        "top10_acc": top10_acc,
        "top50_acc": top50_acc,
        "tf": tf,
}
model = load_model(
    path_to_hdf5, 
    custom_objects=custom_objects
)

and this worked for me with tensorflow=2.8.1 but not with the latest version

But then I realized that you changed the file format, and I am wondering if this is causing the issues.

How big is your model file? You could share it via e.g. Dropbox

JeremyMolineau commented 3 months ago

You are right, the versions of tensorflow (2.16.2) & keras (3.0.5) in my environment are different from the poetry.lock file. This explains many of the problems I encountered in training a new model.

I encountered difficulties installing tensorflow from poetry and the installation of tensorflow followed #13

SBC-ICOA commented 3 months ago

Hello,

To follow up what Jeremy wrote, I reinstalled everything, from scratch, on my machine and, with Jeremy, was able to apply the whole process from downloading the uspto data to generating the models uspto_expansion.onnx and uspto_ringbreaker_expansion.onnx. Applying modifications about siblings was helpful (issue 17). I've the the same versions as you for tensorflow (2.8.1) and keras (2.8.0).

But I have some interrogations ?

During the installation process I saw the following message several times : DEPRECATION: pytorch-lightning 1.8.3.post2 has a non-standard dependency specifier torch>=1.9.*.

In addition, when I first used aizynthtrain.pipelines.expansion_model_pipeline I got a message about tensorflow :

TypeError: Descriptors cannot be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

Dowgrading protobuf from 5.26 to 3.20.2 solved this problem.

Restarting the pipeline I got the new message :

W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2024-04-05 17:13:16.883078: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

I think the script is not able to find cuda 11 on my machine because cuda 12 is installed. The pipeline is finally running but using CPU instead GPU. It doesn't really matter, because it takes less than a day to get the models and I will not have to retrain a model every day !

In my opinion, the problems are linked to the tensorflow version. To solve my problems I had to downgrade some libraries. I've seen that another solution might be to update tensorflow, but in that case I'll probably get the same problems as Jeremy.

Do you plan to integrate the latest versions of tensorflow and keras, although I suppose it could mean a lot of work to adapt your code?

Regards

SGenheden commented 3 months ago

Thanks for the feedback and your willingness to live with a few bumps.

It has made us realize that we need to substantially re-factor the code, most likely move tensorflow models to pytorch, which are easier to maintain.