FAIR-Chem / fairchem

FAIR Chemistry's library of machine learning methods for chemistry
https://opencatalystproject.org/
Other
724 stars 239 forks source link

(OTF) Normalization and element references #715

Open lbluque opened 1 month ago

lbluque commented 1 month ago

This PR enables (on the fly) fitting and estimation of normalization values and element references

dataset:
  train:
    tranforms:
      normalizer:
        fit:
          targets:
            - energy
          batch_size: 32
          num_batches: 1000
      element_references:
        fit:
          targets:
            - energy
          batch_size: 32
          num_batches: 1000

TODO:

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 90.07353% with 27 lines in your changes missing coverage. Please review.

Files Coverage Δ
src/fairchem/core/modules/transforms.py 55.17% <100.00%> (ø)
src/fairchem/core/datasets/ase_datasets.py 85.84% <87.50%> (-0.21%) :arrow_down:
src/fairchem/core/trainers/base_trainer.py 86.09% <93.10%> (+0.62%) :arrow_up:
src/fairchem/core/trainers/ocp_trainer.py 67.45% <89.47%> (+1.13%) :arrow_up:
src/fairchem/core/common/distutils.py 31.89% <62.50%> (+1.89%) :arrow_up:
...m/core/modules/normalization/element_references.py 94.87% <94.87%> (ø)
...fairchem/core/modules/normalization/_load_utils.py 81.08% <81.08%> (ø)
.../fairchem/core/modules/normalization/normalizer.py 91.30% <91.30%> (ø)
misko commented 1 week ago

@lbluque This is a massive lift! This PR is awesome! 💯 💯 💯 I am bookmarking this and striving for this kind of clarity and test coverage! ❤️ LGTM!

zulissimeta commented 3 days ago

Can you clarify what the procedure is to update / re-reference linear references for a pre-trained model? Run the script, load the resulting config, and update the pre-trained model's config?

lbluque commented 3 days ago

Can you clarify what the procedure is to update / re-reference linear references for a pre-trained model? Run the script, load the resulting config, and update the pre-trained model's config?

If you already have linear references fit. Then you can simply pass those as an npz file as follows:

      element_references:
        energy:
          file: /path/to/linref.npz

And make sure you remove the lin_ref keyword in the dataset section.

If you want to use normalization values you already computed you can keep that the same as in previous configs.

Any checkpoint that you save using this branch will then have the new linref and normalization modules.

lbluque commented 1 day ago

Overall looks good — going to be a nice addition. A couple high level comments:

  • It might be nice to offload some of code added to the trainer to utils or other files, to keep the trainers clean/easy to read.
  • Do you know roughly the largest number of samples that can be processed OTF without hitting memory issues? Would be nice to to include an estimate.
  • Are you planning to include a convergence plot of the OTF numbers (lin ref and norm) on a given dataset?

Thanks for the careful review @wood-b !

Have a look at the changes when you get a chance and let me know what you think!