(OTF) Normalization and element references

lbluque commented 1 month ago

This PR enables (on the fly) fitting and estimation of normalization values and element references

Normalizers and LinearReference modules are trainer attributes.
This also cleans up the use of linear references previously inside datasets - they are now saved as part of the checkpoint (no need to insert them into checkpoints after training for testing/inference)
Snuck in a fix when reading ASE Datasets from a list of paths
normalization values and/or linear references can be estimated at runtime before training, ie a config to enable this:

dataset:
  train:
    tranforms:
      normalizer:
        fit:
          targets:
            - energy
          batch_size: 32
          num_batches: 1000
      element_references:
        fit:
          targets:
            - energy
          batch_size: 32
          num_batches: 1000

added scripts to fit linear references and/or normalizers using the train dataset in a standard config (with fitting directive as specified above), i.e.

python src/fairchem/core/scripts/fit_references.py --config path/to/config.yml
python src/fairchem/core/scripts/fit_normalizers.py --config path/to/config.yml --linref-path path/energy_linref.pt

linear references can also be passed as a file in the dataset/transforms block (for example if fit with above script, or legacy npz files):
```
  element_references:
    energy:
      file: /path/to/file.pt/or/npz
```
normalization values can also be passed from a file for many targets (the script above generates a dict with targets and normalizers):
```
  normalizer:
    file: norms.pt
```
or they can be passed by individual files (an npz or state_dict.pt with "mean" and "std")
```
  normalizer:
    energy:
      file: energy_norms.pt  # or .npz
```
using lin_ref for linear references inside datasets is still enabled for backwards compatibility.

TODO:

[x] Make sure that otf_fit does not refit on resubmission
[x] Write unit-tests
[x] Add option to run fit normalizers/element references and save

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 90.07353% with 27 lines in your changes missing coverage. Please review.

Files	Coverage Δ
src/fairchem/core/modules/transforms.py	`55.17% <100.00%> (ø)`
src/fairchem/core/datasets/ase_datasets.py	`85.84% <87.50%> (-0.21%)`	:arrow_down:
src/fairchem/core/trainers/base_trainer.py	`86.09% <93.10%> (+0.62%)`	:arrow_up:
src/fairchem/core/trainers/ocp_trainer.py	`67.45% <89.47%> (+1.13%)`	:arrow_up:
src/fairchem/core/common/distutils.py	`31.89% <62.50%> (+1.89%)`	:arrow_up:
...m/core/modules/normalization/element_references.py	`94.87% <94.87%> (ø)`
...fairchem/core/modules/normalization/_load_utils.py	`81.08% <81.08%> (ø)`
.../fairchem/core/modules/normalization/normalizer.py	`91.30% <91.30%> (ø)`

misko commented 1 week ago

@lbluque This is a massive lift! This PR is awesome! 💯 💯 💯 I am bookmarking this and striving for this kind of clarity and test coverage! ❤️ LGTM!

zulissimeta commented 3 days ago

Can you clarify what the procedure is to update / re-reference linear references for a pre-trained model? Run the script, load the resulting config, and update the pre-trained model's config?

lbluque commented 3 days ago

Can you clarify what the procedure is to update / re-reference linear references for a pre-trained model? Run the script, load the resulting config, and update the pre-trained model's config?

If you already have linear references fit. Then you can simply pass those as an npz file as follows:

      element_references:
        energy:
          file: /path/to/linref.npz

And make sure you remove the lin_ref keyword in the dataset section.

If you want to use normalization values you already computed you can keep that the same as in previous configs.

Any checkpoint that you save using this branch will then have the new linref and normalization modules.

lbluque commented 1 day ago

Overall looks good — going to be a nice addition. A couple high level comments:

It might be nice to offload some of code added to the trainer to utils or other files, to keep the trainers clean/easy to read.

Do you know roughly the largest number of samples that can be processed OTF without hitting memory issues? Would be nice to to include an estimate.

Are you planning to include a convergence plot of the OTF numbers (lin ref and norm) on a given dataset?

Thanks for the careful review @wood-b !

I offloaded much of the code to helper functions outside of the trainer now.
Depends on the amount of memory available on CPU, which is probably a lot for most users. My guess is the runtime when using many batches in very large datasets will be the blocker before memory issues.
Working on getting some convergence plots for some of our datasets. But I dont think this PR should wait until I have those.

Have a look at the changes when you get a chance and let me know what you think!

FAIR-Chem / fairchem

(OTF) Normalization and element references #715

Codecov Report