[ENH] Neat and automated transfer learning with OPTIMADE API for auto-adjusted problem-specific ML model generation on the fly

amkrajewski commented 7 months ago

As the title says, this new addition to the core pySIPFENN functionalities connects it to OPTIMADE API to enable rapid adjustment of the models to any specific dataset described by an OPTIMADE query (or multiple queries). Most of the functions are neatly hidden behind high-level API and default values should work well for datasets between 100-10,000 datapoints.

You can now simply:

from pysipfenn import Calculator, OPTIMADEAdjuster
c = Calculator(autoLoad=False)
c.loadModels("SIPFENN_Krajewski2022_NN30")
ma = OPTIMADEAdjuster(c, "SIPFENN_Krajewski2022_NN30",  useClearML=True, device='mps') # MPS is for Apple M1 GPU

ma.fetchAndFeturize(
    'elements HAS "Hf" AND elements HAS "Mo" AND NOT elements HAS ANY "O","C","F","Cl","S"',
    parallelWorkers=4)
ma.adjust()

ma.plotStarting() # See the starting performance
ma.plotAdjusted() # See the adjusted performance

or to perform a hyperparameter search, replace the ma.adjust() with:

ma.matrixHyperParameterSearch()
ma.adjust(learningRate=0.0001, optimizer='AdamW', weightDecay=1e-05, epochs=37)

All model usage works as before with the Calculator class. Modifying or exporting it for later is through specific classes in the modelExporters submodule.

amkrajewski commented 7 months ago

Notes:

It is feature-complete.
I'm still working on the testing suite.
@rdamaral will add a neat tutorial at a future date.

codecov[bot] commented 7 months ago

Codecov Report

Attention: Patch coverage is 89.20188% with 46 lines in your changes are missing coverage. Please review.

Project coverage is 93.58%. Comparing base (74a31dd) to head (4c12c21). Report is 4 commits behind head on main.

Files	Patch %	Lines
pysipfenn/core/modelAdjusters.py	87.15%	46 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #16 +/- ## ========================================== - Coverage 94.84% 93.58% -1.27% ========================================== Files 17 19 +2 Lines 1999 2432 +433 ========================================== + Hits 1896 2276 +380 - Misses 103 156 +53 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

amkrajewski commented 7 months ago

Hi @jwsiegel2510 and @rdamaral Everything is complete and the tests are passing. It's ready to be reviewed!

amkrajewski commented 7 months ago

Hi @jwsiegel2510 and @rdamaral, I was hoping to pull it later today to align with the manuscript posting on arXiv.

ricardonpa commented 7 months ago

Hi Adam,

I've reviewed the documentation, tested the main functions, and they are working well. I also did not encounter any issues when installing this branch version in a new conda environment (Python 3.10).

Just a couple of comments:

Running the hyperparameter search step takes a long time on a cpu, so I couldn't finish testing. As discussed, please consider changing the default epochs to a lower value.
When testing OPTIMADE providers:

OQMD returned TypeError: can only concatenate str (not "int") to str JARVIS returned ValidationError: 1 validation error for StructureResource. Aflow returned Error: Provider ...: ('Connection aborted.', ... )) PS: Both OQMD and Jarvis were run using the targetPath values mentioned in the documentation. Other than these, MP and Alexandria were also tested and did not raise any error.
I also tried providing the wrong targetPath for a provider and it raises: 'ValueError: not enough values to unpack (expected 4, got 0)'. Do you think targetPath could be fetched automatically from each provider's endpoint when defyining OPTIMADEAdjuster? I'm considering this because even the current default values may eventually break, for instance, if MP decides to change their endpoint (again). Another alternative would be to have targetPath set to () rather than MP's formation energy path.

amkrajewski commented 7 months ago

Hi @rdamaral ! Thanks for the insightful comments :)

The default number of epochs for fine-tuning was reduced to 20, with documentation discussing this and mentioning that on a GPU (even a laptop one) 100 may be preferred.
The OQMD server is down, and JARVIS seems to have issues filtering. I will ask about that at the developer meeting tomorrow.
I've added a bunch of assertions that should catch unexpected user inputs and display useful messages on what went wrong. The property data paths are provider-specific and cannot be inferred a prior.

amkrajewski commented 7 months ago

I also added a new functionality that allows you to override provider and use a custom endpoint. E.g.

ma = pysipfenn.OPTIMADEAdjuster(
    c,
    model="SIPFENN_Krajewski2022_NN30",
    endpointOverride=["https://alexandria.icams.rub.de/pbesol"],
    targetPath=['attributes', '_alexandria_formation_energy_per_atom']
)

ma.fetchAndFeturize(
    'elements HAS "Hf" AND elements HAS "Mo" AND elements HAS "Zr"',
    parallelWorkers=2
)

ricardonpa commented 7 months ago

Nice. The endpointOverride input is very interesting from the user’s perspective, especially in the event of changes or new additions to OPTIMADE. 👍

PhasesResearchLab / pySIPFENN

[ENH] Neat and automated transfer learning with OPTIMADE API for auto-adjusted problem-specific ML model generation on the fly #16

Codecov Report