MolecularAI / aizynthtrain

Tools to train synthesis prediction models
Apache License 2.0
21 stars 7 forks source link

ValueError: The filepath provided must end in `.keras` (Keras model format) #22

Open cespos opened 4 days ago

cespos commented 4 days ago

Hi!

I have been trying to use AiZynthTrain to train AiZynthFinder with some personal reactions and reaction template. I have mapped and cleaned the reaction and template files with my own protocols and my goal is to retrain AiZynthFinder without running any additional cleaning/preparation step.

I have used the expansion pipeline with the following config file:

expansion_model_pipeline:
  python_kernel: aizynthtrain
  file_prefix: test
  nbatches: 200
  training_fraction: 0.9
  random_seed: 1689
  selected_ids_path: "lookup_templates.json"

And I got the following errors during training:

2024-09-16 13:44:07.814 [1726483142464316/model_training/206 (pid 3123416)] Task is starting.
2024-09-16 13:44:08.591 [1726483142464316/model_training/206 (pid 3123416)] 2024-09-16 13:44:08.591729: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-16 13:44:08.604 [1726483142464316/model_training/206 (pid 3123416)] 2024-09-16 13:44:08.604123: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-16 13:44:08.607 [1726483142464316/model_training/206 (pid 3123416)] 2024-09-16 13:44:08.607848: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-09-16 13:44:13.767 [1726483142464316/model_training/206 (pid 3123416)] <flow ExpansionModelFlow step model_training> failed:
2024-09-16 13:44:13.873 [1726483142464316/model_training/206 (pid 3123416)] Internal error
2024-09-16 13:44:13.875 [1726483142464316/model_training/206 (pid 3123416)] Traceback (most recent call last):
2024-09-16 13:44:13.875 [1726483142464316/model_training/206 (pid 3123416)] File "/data/users/carespos/conda/envs/aizynthtrain/lib/python3.10/site-packages/metaflow/cli.py", line 1134, in main
2024-09-16 13:44:13.875 [1726483142464316/model_training/206 (pid 3123416)] start(auto_envvar_prefix="METAFLOW", obj=state)
2024-09-16 13:44:13.875 [1726483142464316/model_training/206 (pid 3123416)] File "/data/users/carespos/conda/envs/aizynthtrain/lib/python3.10/site-packages/metaflow/tracing/__init__.py", line 27, in wrapper_func
2024-09-16 13:44:13.875 [1726483142464316/model_training/206 (pid 3123416)] return func(args, kwargs)
2024-09-16 13:44:14.668 [1726483142464316/model_training/206 (pid 3123416)] File "/data/users/carespos/conda/envs/aizynthtrain/lib/python3.10/site-packages/metaflow/_vendor/click/core.py", line 829, in __call__
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] return self.main(args, kwargs)
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] File "/data/users/carespos/conda/envs/aizynthtrain/lib/python3.10/site-packages/metaflow/_vendor/click/core.py", line 782, in main
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] rv = self.invoke(ctx)
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] File "/data/users/carespos/conda/envs/aizynthtrain/lib/python3.10/site-packages/metaflow/_vendor/click/core.py", line 1259, in invoke

2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] return _process_result(sub_ctx.command.invoke(sub_ctx))
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] File "/data/users/carespos/conda/envs/aizynthtrain/lib/python3.10/site-packages/metaflow/_vendor/click/core.py", line 1066, in invoke
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] return ctx.invoke(self.callback, ctx.params)
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] File "/data/users/carespos/conda/envs/aizynthtrain/lib/python3.10/site-packages/metaflow/_vendor/click/core.py", line 610, in invoke
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] return callback(args, kwargs)
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] File "/data/users/carespos/conda/envs/aizynthtrain/lib/python3.10/site-packages/metaflow/_vendor/click/decorators.py", line 21, in new_func
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] return f(get_current_context(), args, kwargs)
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] File "/data/users/carespos/conda/envs/aizynthtrain/lib/python3.10/site-packages/metaflow/cli.py", line 468, in step
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] task.run_step(
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] File "/data/users/carespos/conda/envs/aizynthtrain/lib/python3.10/site-packages/metaflow/task.py", line 650, in run_step
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] self._exec_step_function(step_func)
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] File "/data/users/carespos/conda/envs/aizynthtrain/lib/python3.10/site-packages/metaflow/task.py", line 62, in _exec_step_function
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] step_function()
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] File "/data/users/carespos/conda/envs/aizynthtrain/lib/python3.10/site-packages/aizynthtrain/pipelines/expansion_model_pipeline.py", line 83, in model_training
2024-09-16 13:44:14.670 [1726483142464316/model_training/206 (pid 3123416)] training_runner([self.config_path])
2024-09-16 13:44:14.670 [1726483142464316/model_training/206 (pid 3123416)] File "/data/users/carespos/conda/envs/aizynthtrain/lib/python3.10/site-packages/aizynthtrain/modelling/expansion_policy/training.py", line 83, in main
2024-09-16 13:44:14.670 [1726483142464316/model_training/206 (pid 3123416)] callbacks = setup_callbacks(
2024-09-16 13:44:14.670 [1726483142464316/model_training/206 (pid 3123416)] File "/data/users/carespos/conda/envs/aizynthtrain/lib/python3.10/site-packages/aizynthtrain/utils/keras_utils.py", line 76, in setup_callbacks
2024-09-16 13:44:14.670 [1726483142464316/model_training/206 (pid 3123416)] checkpoint = ModelCheckpoint(
2024-09-16 13:44:14.670 [1726483142464316/model_training/206 (pid 3123416)] File "/data/users/carespos/conda/envs/aizynthtrain/lib/python3.10/site-packages/keras/src/callbacks/model_checkpoint.py", line 191, in __init__
2024-09-16 13:44:14.670 [1726483142464316/model_training/206 (pid 3123416)] raise ValueError(
2024-09-16 13:44:14.670 [1726483142464316/model_training/206 (pid 3123416)] ValueError: The filepath provided must end in `.keras` (Keras model format). Received: filepath=test_keras_model.hdf5
2024-09-16 13:44:14.670 [1726483142464316/model_training/206 (pid 3123416)]
2024-09-16 13:44:14.674 [1726483142464316/model_training/206 (pid 3123416)] Task failed.
2024-09-16 13:44:14.679 Workflow failed.
2024-09-16 13:44:14.679 Terminating 0 active tasks...
2024-09-16 13:44:14.679 Flushing logs...
    Step failure:
    Step model_training (task-id 206) failed.

where the final error is:

2024-09-16 13:44:14.670 [1726483142464316/model_training/206 (pid 3123416)] ValueError: The filepath provided must end in `.keras` (Keras model format). Received: filepath=test_keras_model.hdf5

Many thanks!

Carmen

cespos commented 2 days ago

I fixed it by installing specific keras and tensorflow versions:

pip install keras==2.8.0
pip install tensorflow==2.8.0
pip install tensorboard==2.8.0
pip install tensorflow-serving-api==2.8.0

To avoid this issue to occur in the future, the dependencies could be added to the pyproject.toml.

However, I got now another error during validation:

FileNotFoundError: [Errno 2] No such file or directory: 'testing_template_library.csv'

Even if I did not configure the validation pipeline, it seems it's still running it.

Best, Carmen