Closed LyuboKotop closed 6 days ago
Hello,
My best guess is that the validation step broke down for batch 38 and hence it did not produce the necessary CSV file. You should have some record of the error message from this step in the log, but it is easy to miss and often hard to interpret.
It is a bit cumbersome to debug this, but the easiest is to try to run the validation step manually, but only for batch 38.
First, figure out the batch start and end indices. The easiest is likely to add a print-statement in aizynthtrain/pipelines/template_pipeline.py
at line 57, printing idx
, start
, end
Seccond, run
python -m rxnutils.pipeline.runner --pipeline aizynthtrain/pipelines/data/reaction_validation_pipeline.yaml --data imported_data.py --output temp.csv --max-workers 1 --batch START END --no-intermediates
where imported_data.py
is the CSV produced in the step before. "START" and "END" are the batch start and end indices that you got from the print-statement.
This should give you a clear error message.
So I added the print statement and ran the pipeline with:
python -m aizynthtrain.pipelines.template_pipeline run --config template_pipeline_config.yml --max-workers 32 --max-num-splits 200
Last print came as Start: 38, End: 39.
So I then ran:
python -m rxnutils.pipeline.runner --pipeline reaction_validation_pipeline.yaml --data imported_data.py --output temp.csv --max-workers 1 --batch 38 39 --no-intermediates
But got error:
FileNotFoundError: [Errno 2] No such file or directory: 'imported_data.py'.
I am a bit unsure where the "imported_data.py" file should come from.
This should be generated from the previous step.
But I gave you the wrong standard name, it should be called imported_reactions.csv
and it should be available in your folder.
Still cannot find this file:
FileNotFoundError: [Errno 2] No such file or directory: 'imported_reactions.csv'
Are you running the pipeline from this folder?
When you ran python -m aizynthtrain.pipelines.template_pipeline
did you run it in this folder?
That pipeline should have taken your custom data and imported into a format that is compatible with the rest of the pipeline. And this import should have created a csv file that is by default called imported_reactions.csv
.
So if this file is not created, I am questioning how the pipeline could be run at all, and how you are importing your custom data. Please provide more details.
I run the pipeline inside the folder that contains the template_pipeline_config file is (aizynthtrain/configs/uspto) (it is not this folder).
I tried running the pipeline in the folder you suggested (aizynthtrain/pipelines/data but I got error since the template_pipeline_config.yml file was not in this folder. Therefore, I copied the config file in the folder and reran the pipeline.
This time the pipeline ran for longer than usual but this time led to a new error:
2024-08-28 16:39:26.210 [1724863064098243/reaction_selection/205 (pid 217306)]
Ok. This error is easier to understand:
It comes from trying to execute this line
2024-08-28 16:39:26.881 [1724863064098243/reaction_selection/205 (pid 217306)] ----> 1 info = data["id"].str.extract(r"_P(?P<product_no>\d)$", expand=False)
The code assumes that all of your IDs in the ID-column are strings. If they are not, this will lead to an error.
I am curious to see that you earlier error appears to have vanished when you re-run the pipeline.
My intention with the previous comment was that you should not execute the full template extraction-pipeline but rather run python -m rxnutils.pipeline.runner
to check an individual step in the pipeline.
I ran the rxnutils.pipeline.runner in the aizynthtrain/pipelines/data folder.
I originally ran the full template extraction-pipeline in the aizynthtrain/configs/uspto folder that led to the original error (validated_reactions.csv), while now when I run the full template pipeline in the aizynthtrain/pipelines/data folder I get the ID error.
Ok, so I changed my values in the "ID" column to be strings and this got rid of the str error and the pipeline was able to run a bit further (I run it inside the aizynthtrain/pipelines/data folder).
A new error has occurred now:
2024-08-29 10:25:55.398 [1724927051903642/template_extraction_join/242 (pid 311757)]
It seems that a new csv is missing now.
So what happens if you run the suggest command
python -m rxnutils.pipeline.runner --pipeline aizynthtrain/pipelines/data/reaction_validation_pipeline.yaml --data imported_reactions.py --output temp.csv --max-workers 1 --batch START END --no-intermediates
In the same folder?
(aizynthtrain) kotop@DESKTOP-S2DEI0D:~/aizynthtrain/aizynthtrain/pipelines/data$ python -m rxnutils.pipeline.runner --pipeline reaction_validation_pipeline.yaml --data imported_reactions.csv --output temp.csv --max-workers 1 --batch 38 39 --no-intermediates Running isotope_info (extract and remove isotope information from reactions) Running remove_unsanitizable (removing molecules that is not sanitizable by RDKit) Running reagents2reactants (putting all reagents to reactants) Running reactants2reagents (putting all non-reacting reactants as reagents) Running remove_extra_atom_mapping (removing atom maps in reactants and reagents not in products) Running neutralize_molecules (neutralize molecules using RDKit neutralizer) Running remove_unsanitizable (removing molecules that is not sanitizable by RDKit) Running remove_unchanged_products (Remove unchanged products) Running count_components (counting reactants, reagents, products and mapped versions of these) Running pseudo_reaction_hash (calculate hash based on InChI key of components) Running count_elements (calculate the occurence of elements in the reactants) Running productsize (number of heavy atoms in product) Running product_atommapping_stats (count number of number of unmapped and widow product atoms) Running hasunmappedradicalatom (detect if there is an unmapped radical in the reaction SMILES) Running unsanitizablereactants (detect if there is unsanitizable reactants) Running maxrings (maximum number of rings) Running ringnumberchange (ring change based on number of rings) Running ringbondmade (ring change based on ring bond made) Running ringmadesize (largest ring made) Running cgr_created (flag if a CGR can be created for the reaction) Running cgr_dynamic_bonds (number of dynamic bonds in the CGR)
Seems to be working fine. This is for START END 38 39.
It is a bit worrying that it failed on another batch this time. I also see that you have very small batches. How many data points do you have?
I remember that we had some issues with if the number of data points where low compared to the number of batches requsted, and i thought we fixed this. But perhaps worth trying with setting the number of batches to something like 50. You can do that in the yaml-file you are providing the pipeline:
file_prefix: SOMETHING
nbatches: 50
So my dataset consists of about 1000 reactions. Maybe the dataset is too small?
It should work. But let's try with 20 or 50 batches. The default of 200 comes from my dataset size which is on the order of millions.
Ok, 50 batches did not work. But it seems that 20 batches did the trick:
2024-08-29 13:10:52.661 [1724937000866286/template_validation_join/69 (pid 323145)] RxnSmilesClean ... TemplateGivesOtherReactants 2024-08-29 13:10:52.662 [1724937000866286/template_validation_join/69 (pid 323145)] 0 [O:1]=[C:2]([N:3]1[CH2:4][CH2:5][C:6]2([CH2:7]... ... False 2024-08-29 13:10:52.662 [1724937000866286/template_validation_join/69 (pid 323145)] 1 [CH3:1][CH2:2][CH2:3][CH2:4][CH2:5][CH2:6][c:1... ... False 2024-08-29 13:10:52.663 [1724937000866286/template_validation_join/69 (pid 323145)] 0 [CH3:1]N:2[CH2:4][CH2:5][CH2:6][C:7... ... False 2024-08-29 13:10:52.663 [1724937000866286/template_validation_join/69 (pid 323145)] 1 [CH3:1]C:2[O:4][c:5]1[cH:6][cH:7][cH... ... False 2024-08-29 13:10:52.664 [1724937000866286/template_validation_join/69 (pid 323145)] 0 [O:1]=[C:2]([NH:3][CH2:4][CH2:5][CH2:6][c:17]1... ... False 2024-08-29 13:10:52.664 [1724937000866286/template_validation_join/69 (pid 323145)] 2024-08-29 13:10:52.664 [1724937000866286/template_validation_join/69 (pid 323145)] [5 rows x 13 columns] 2024-08-29 13:10:52.665 [1724937000866286/template_validation_join/69 (pid 323145)] LYUBOMIR: Successfully wrote batch file: reaction_templates_validated.csv 2024-08-29 13:10:53.085 [1724937000866286/template_validation_join/69 (pid 323145)] LYUBOMIR: Going out of combine_csv 2024-08-29 14:10:53.086 [1724937000866286/template_validation_join/69 (pid 323145)] Task finished successfully. 2024-08-29 14:10:53.091 [1724937000866286/template_selection/70 (pid 323211)] Task is starting. Executing: 100%|██████████| 16/16 [00:03<00:00, 4.46cell/s]/70 (pid 323211)] Executing: 0%| | 0/16 [00:00<?, ?cell/s] 2024-08-29 14:11:00.249 [1724937000866286/template_selection/70 (pid 323211)] Task finished successfully. 2024-08-29 14:11:00.254 [1724937000866286/end/71 (pid 323333)] Task is starting. 2024-08-29 13:11:02.759 [1724937000866286/end/71 (pid 323333)] Report on extracted reaction is located here: reaction_selection_report.html 2024-08-29 13:11:03.146 [1724937000866286/end/71 (pid 323333)] Report on extracted templates is located here: template_selection_report.html 2024-08-29 14:11:03.148 [1724937000866286/end/71 (pid 323333)] Task finished successfully. 2024-08-29 14:11:03.148 Done!
Ok. Annoying error, hard to debug. How many reactions were left from the selection? And how many templates where produced in the end? Are these acceptable numbers?
Total number of extracted reactions = 950 Total number of extracted templates = 35 (1.00%)
This sounds alright to me. The percentage of extracted unique templates is about what we have for USPTO or our internal data.
Ok so it was just the nbatches (I ran it in the original folder and it worked so doesnt matter which folder as long as it has the config.yml).
The most annoying part is that I remember I tried to reduce the nbatches myself in the config file but maybe I didnt go as low as 20 :'(
Would I need to set nbatches to 20 for the expansion pipeline too?
I tried running the expansion model pipeline inside the folder where the template pipeline files were generated and got the following error:
2024-08-29 18:52:00.820 [1724957504598758/create_template_metadata/2 (pid 331309)]
I removed the:
routes_to_exclude:
Statement from the expansion model config yml file which led to another batch error, so I added the nbatches: 20 statement here too. After doing this, there was another error (FYI: I was able to generate the uspto_keras_model.hdf5 trained Keras model and the uspto_unique_templates.csv.gz template library for AiZynthFinder files despite the error):
2024-08-29 19:08:13.773 [1724958364809134/model_validation/28 (pid 344216)] Traceback (most recent call last):
2024-08-29 19:08:15.269 [1724958364809134/model_validation/28 (pid 344216)] File "/home/kotop/miniconda3/envs/aizynthtrain/bin/aizynthcli", line 8, in
Maybe something to do with the model evaluation statements inside the config file (possibly stock_for_finding: stock_for_eval_find.hdf5)? I am not entirely sure what purpose those serve, maybe I do not need them for my particular dataset and should omit them?
Yes, if you do
expansion_model_evaluation:
file_prefix: uspto
stock_for_finding:
target_smiles:
stock_for_recovery:
reference_routes:
it will not do the multistep evaluation.
Otherwise you can download the files from here: https://zenodo.org/records/7341155
2024-09-03 09:39:10.590 [1725356268009387/model_validation/9 (pid 349115)]
It seems that they can't be left blank. They can be completely omitted though.
I can't really find the evaluation files in the link you provided. Are they named differently? This is what I am after:
stock_for_finding: stock_for_eval_find.hdf5 target_smiles: smiles_for_eval.txt stock_for_recovery: stock_for_eval_recov.txt reference_routes: routes_for_eval.json
I only managed to find the ref_routes.
Ok. My bad - try to leave them as empty strings
expansion_model_evaluation:
file_prefix: uspto
stock_for_finding: ""
target_smiles: ""
stock_for_recovery: ""
reference_routes: ""
or I have attached 3 of the files here routes_for_eval.json stock_for_eval_recov.txt smiles_for_eval.txt
The route_for_eval.json and stock_for_eval_recov.txt are subsets of PaRoutes
smiles_for_eval.txt is a subset of ChEMBL
The stock for finding is the ZINC stock that can be downloaded from here: https://figshare.com/articles/dataset/AiZynthFinder_a_fast_robust_and_flexible_open-source_software_for_retrosynthetic_planning/12334577?file=23086469
Ok thanks.
What would be the downside of leaving them as empty strings? I guess we wouldn't be able to evaluate the performance properly? Although, I believe that even if I use the files you provided, the evaluation will still not be relevant, since my model has been trained on a specific type of reactions which would probably not be present in your files anyway. I guess the only solution would be to prepare custom evaluation files myself with relevant reactions.
Also, as a conclusion for this thread, do you know why higher nbatches values lead to improper execution of the pipelines in terms of missing validated_reactions csv files?
Regarding evaluation: this was introduced as a quick way to assess your models, to sort of give you an indication When you do frequent retraining. If you are just doing a one-off training, I would suggest evaluating on a larger dataset ”manually”
Regarding the batch number issue: I dont have good answer now, We have to look into this.
Hey, I am currently trying to train a model with my own dataset. So I pre-processed my data with the rxnutils and rxnmapper pipelines and I got my data_mapped.csv file that I would use for training.
I then updated the template_pipeline_config.yml file with the path to my dataset and ran the aizynthrain template pipeline with: python -m aizynthtrain.pipelines.template_pipeline run --config template_pipeline_config.yml --max-workers 32 --max-num-splits 200
It seems that the aizynthtrain works correctly as I get multiple outputs with "Running..." and "Task finished successfully" but I always end up with the error "FileNotFoundError: [Errno 2] No such file or directory: 'validated_reactions.csv.38'".
I also tried running the template pipeline on the USPTO data from the tutorial (even putting it in the same directory as the template_pipeline config file) and it still comes up with the same error.
I looked into it and I believe the issue might be coming from when the aizynthtrain calls rxnutils, specifically in the batch_utils.py file within the combine_csv_batches function. After adding in some print statements, it seems that we get inside the combine_csv_batches function but none of the print statements within the _write_csv function produce any outputs. I believe the _write_csv function does not get called for some reason.
I would greatly appreciate it if you could help out with this one.