Warning: detected nan, resetting output to zero.

vinhsuhi commented 1 year ago

Dear author,

Thank you for releasing the code! However, when I try to run an evaluation with the released checkpoint, the console prints out 'warning: detected nan, resetting output to zero.

Is this a bug, or is it normal to have nan values like that?

Here is the reproducible script: python main.py --config configs/vpsde_qm9_uncond_jodo.py --mode eval --workdir exp_uncond/vpsde_qm9_jodo --config.eval.ckpts '30' --config.eval.batch_size 2500 --config.sampling.steps 1000

Best, Vinh

GRAPH-0 commented 1 year ago

Dear Vinh,

Thanks for your interest! It is not normal to have nan, which means that the value of coordinates exploded during the generation process.

But it didn't happen in my own run, here is my log content from eval_stdout.txt:

INFO - run_lib.py - 2023-05-25 18:18:38,203 - model size: 105.1MB
INFO - run_lib.py - 2023-05-25 18:19:16,077 - Sampling -- ckpt: 30
INFO - run_lib.py - 2023-05-25 18:20:23,442 - model size: 105.1MB
INFO - run_lib.py - 2023-05-25 18:21:01,266 - Sampling -- ckpt: 30
INFO - run_lib.py - 2023-05-25 18:40:03,915 - Number of molecules: 10000
INFO - run_lib.py - 2023-05-25 18:40:03,915 - 3D atom stability: 0.9929, mol stability: 0.9368, validity: 0.9680, complete: 0.9665, unique & valid: 0.9396, novelty: 0.8898
INFO - run_lib.py - 2023-05-25 18:40:22,904 - 3D FCD: 0.9627, SNN: 0.4897, Frag: 0.9782, Scaf: 0.7942, IntDiv: 0.9161
INFO - run_lib.py - 2023-05-25 18:40:36,460 - 2D atom stability: 0.9988, mol stability: 0.9867, validity: 0.9896, complete: 0.9892, unique & valid: 0.9603, novelty: 0.9003
INFO - run_lib.py - 2023-05-25 18:40:54,812 - 2D FCD: 0.1494, SNN: 0.5168, Frag: 0.9871, Scaf: 0.9252, IntDiv: 0.9189
INFO - run_lib.py - 2023-05-25 18:40:54,812 - Mean QED: 0.4626, MCF: 0.5450, SAS: 4.4780, logP: 0.0721, MW: 124.3350

I don't know if this is due to the precision of different GPUs or some random factors, at least I can get normal results with different seeds. Maybe you could check that the checkpoint is loaded correctly, or try to generate some samples first with a smaller batch size, e.g. --config.eval.batch_size 100 --config.eval.num_samples 100.

Best, Han

vinhsuhi commented 1 year ago

Dear Han,

Thank you very much for the quick response!

I still have the same problem even if the number of samples is 1, 10, or 100. It happens with all the generated samples.

It could also be because of the mismatch between my library and recommended versions. I will try to set up the environment again.

Best, Vinh

GRAPH-0 commented 1 year ago

Dear Vinh,

In this case, I think you can try to train a small model on QM9 first, run through the evaluation, and get reasonable samples.

Best, Han

FilippNikitin commented 1 year ago

For anyone facing the same issue.

It is the problem with using different versions of PyG.

Eventually, line 172 in models/layers.py: extra_inf_heads[extra_inf_heads==0.] = -float('inf') is not numerically stable in the newer PyG, so I changed it to extra_inf_heads[extra_inf_heads==0.] = -1e10, and this solves this issue.

GRAPH-0 commented 1 year ago

For anyone facing the same issue.

It is the problem with using different versions of PyG.

Eventually, line 172 in models/layers.py: extra_inf_heads[extra_inf_heads==0.] = -float('inf') is not numerically stable in the newer PyG, so I changed it to extra_inf_heads[extra_inf_heads==0.] = -1e10, and this solves this issue.

@FilippNikitin Thank you for locating this issue! Could you get comparable performance after modification? I am considering updating the relevant code directly in the repo.

FilippNikitin commented 1 year ago

Hi @GRAPH-0,

Yes, I got very similar to stated in the paper results. Moreover, I also trained the model on my own data and did not face any other numerical issues.

Thank you for your research :)

GRAPH-0 commented 1 year ago

@FilippNikitin Great honor to hear that!

GRAPH-0 / JODO

Warning: detected nan, resetting output to zero. #1