exalearn / electrolyte-design

Workflow tools for electrolyte design project
6 stars 1 forks source link

Understand why active learning gets worse with more data #15

Closed WardLT closed 3 years ago

WardLT commented 3 years ago

We find the performance of our active learning agent gets worse as we retrain the models. The chart below shows how we find fewer high-performing molecules with strategies where we update the model (update and train) than we do with a strategy where we never update the MPNNs (no-retrain)

image

A list of hypotheses:

Potential solutions:

pythonpanda2 commented 3 years ago

@WardLT Something to consider going forward. https://github.com/uncertainty-toolbox/uncertainty-toolbox

We could simply wrap our predictions, std, labels around their API and let the tool box figure out what would be best calibration method.

WardLT commented 3 years ago

Part of the puzzle. Without bootstrap sampling when updating models and only 4 replicas in the ensemble, our uncertainties are much worse after retraining.

image

WardLT commented 3 years ago

The de-calibration is lessened if we use bootstrap sampling when creating the training set before updating the model.

image

WardLT commented 3 years ago

We see similar, slight degradation with the 16 bootstraped models image

WardLT commented 3 years ago

Training with more epochs (here, 512) can make the problem worse

image

WardLT commented 3 years ago

Resetting the weights on the optimizer does seem to help. This is back to using 64 epochs to retrain the model.

Using random initial weights seems to work just as well in terms of the uncertainties.

image

WardLT commented 3 years ago

It was a bug 😆 See: 73f0579