update plot_pred_vs_actual_from_file to have an uncertainty flag inst…

ATOMScience-org / AMPL

The ATOM Modeling PipeLine (AMPL) is an open-source, modular, extensible software pipeline for building and sharing models to advance in silico drug discovery.

MIT License

136 stars 68 forks source link

update plot_pred_vs_actual_from_file to have an uncertainty flag inst… #339

Closed paulsonak closed 3 months ago

paulsonak commented 4 months ago

…ead of forced uncertainty plotting

paulsonak commented 4 months ago

Following your recs @mcloughlin2 I updated the code so that all pred_vs_actual plots call the _from_df() function to graph each axis. However, the stds are always None for live pipes, according to perf_data.get_pred_values(). I think it is how the data is passed into perf_data.accumulate_preds(), but didn't trace this any further. So, when pp.plot_pred_vs_actual(regr_pipe, error_bars=True) is called, it just plots with no error bars because stds is None.

I changed the flag from uncertainty to error_bars.
the plot titles are consistent between _from_file() and from pipe. They are necessarily different in from_df().
threshold is implemented in all 3 versions of the function.
I also added new code to find the split file by searching for the split uuid instead of painstakingly constructing it by the full file name. If you don't see any errors in this method I want to start implementing it in more places.

mcloughlin2 commented 4 months ago

Oh right, I forgot that uncertainties aren't computed during training, so they don't get stored in the PerfData structures. So to get error bars in plot_pred_vs_actual, we'd have to run predictions from the live pipe. That's certainly doable, and would be a little faster than simply calling plot_pred_vs_actual_from_file on the just-saved model, since the model is already loaded.

It's weird, though...we've been thinking all along that 'uncertainty' is a parameter of the model training process, when really it only comes into play at prediction time. Does it change anything about the training process (other than forcing you to include dropouts in every layer)? I guess I'll have to look at the DeepChem code to find out...

paulsonak commented 3 months ago

I updated the plot_pred_vs_actual to call plot_pred_vs_actual_from_file to get the predictions. I first tried to do pipe.predict_full_dataset directly but it modifies the pipeline object in place and got pretty confusing pretty fast.

I think of uncertainty as influencing model selection or HPO but not directly influencing the training process.

mauvais2 commented 3 months ago

Will merge this to 1.6.2. If there are more changes needed, please continue the work in the next release.