DS4SD / MolGrapher

MolGrapher: Graph-based Visual Recognition of Chemical Structures
https://arxiv.org/abs/2308.12234
MIT License
42 stars 1 forks source link

Accuracy issues #8

Closed sincelover closed 4 months ago

sincelover commented 4 months ago

Hi, dear developers! Thank you very much for code sharing!

Can I ask if the model provided is after training? The accuracy of the predictions obtained so far does not meet expectations. The current model only achieves about 67% accuracy on the dataset uspto-10k. Do i need to train again?

lucas-morin commented 4 months ago

There seems to be a problem in the evaluation. How did you run the evaluation?

Best, Lucas

sincelover commented 4 months ago

I run evaluate_molecular_recognition.py with first thousand data in uspto-10k.The following results were obtained. I had similar results after predicting using run.sh(3k data) and comparing by hand using your compute_molecule_prediction_quality function.

{"uspto-10k": {"keypoint_detector_model": "kd_model.ckpt", "graph_classifier_model": "gc_stereo_model.ckpt", "number_input_images": 1000, "number_processed_images": 1000, "molecular_precision": 0.669, "detected_error_rate": 0.0997}}
lucas-morin commented 4 months ago

Running the evaluation script using gc_no_stereo_model, for the first 1000 images of uspto-10k, I get an accuracy of 92.3%. I ran the script with the following parameters:

precompute = False
evaluate = True
test_time_augmentation = False
static_validation = True
clean_only = False
filtered_evaluation = False
filtered_no_charges_evaluation = False
preprocess = True

I computed the scores as defined here: https://github.com/DS4SD/MolGrapher/blob/4fd606b87084e5b8ed18b35b27c70ce824ea6c2a/molgrapher/scripts/evaluate/evaluate_molecular_recognition.py#L460

For information, here is the complete output of the script: eval_output.txt

How did you download and configure the benchmark? Could there be an error here?

sincelover commented 4 months ago

I used your parameter above and got the following output。It does seem to be a bit different compared to your output, and a lot of the following problems occur, so please help me look at it.

Recursive molecule creation
The number of connection point between the predicted abbreviation node and the associated sub-molecule (COH) mismatch
Spelling correction problem: multiple abbreviation candidates for Cs have the same score: ['Cy', 'CN', 'Ms', 'Ts', 'CO', 'CF', 'CsO', 'CH', 'C']. CN is chosen

here is the complete output of the script: eval_output .txt

lucas-morin commented 4 months ago

With this information, I do not see what is the problem.

Here are the images and molfiles I used in the test above: uspto-10k_(1k_subset).txt I can suggest that you check that you are using the same samples.

To debug, you may also use the visualization available in the evaluation script: https://github.com/DS4SD/MolGrapher/blob/4fd606b87084e5b8ed18b35b27c70ce824ea6c2a/molgrapher/scripts/evaluate/evaluate_molecular_recognition.py#L57 https://github.com/DS4SD/MolGrapher/blob/4fd606b87084e5b8ed18b35b27c70ce824ea6c2a/molgrapher/scripts/evaluate/evaluate_molecular_recognition.py#L56 (For that, you may need to create additional folders in ./data/.)

Best, Lucas

sincelover commented 4 months ago

Thank you very much for the suggestion. I compared a portion of the images to the molfiles and did find a mismatch. The reason for this would be a loss of information when storing the molfiles.After re-downloading I got 90% accuracy.

Finally, a special thank you for answering all my questions in detail so that I could successfully get this result.

lucas-morin commented 4 months ago

You're welcome!