Return incomplete results when simulation errors out

brandon-holt commented 7 months ago

Hi! I was wondering if it would be possible to add a feature where simulations will still return the results compiled up to the point of an error?

The situation I'm running into when running on larger datasets is a botorch error to the tune of All attempts to fit the model have failed.

I am in the process of troubleshooting what about the dataset is causing the failure, but in the meantime it would be nice to see the results up to that point, which should include dozens of batches of experiments.

Also, if you have any experience with what might be causing an error like this, that would be helpful!

Referring to this comment in a botorch thread: https://github.com/pytorch/botorch/issues/1226#issuecomment-1213539656 I initially wondered if this could be my issue, but baybe should prevent this from being an issue since it identifies duplicate parameter values and randomly picks one.

brandon-holt commented 7 months ago

Just as a temp local fix I am adding a catch-all except in the DOE loop and breaking to return the incomplete results.

while k_iteration < limit:
        # Get the next recommendations and corresponding measurements
        try:
            measured = campaign.recommend(batch_size=batch_size)
        except:
            break

AdrianSosic commented 7 months ago

Hi @brandon-holt 👋🏼 One part that can be easily answered: your suggestion of providing access to partial simulation results sounds absolutely reasonable and I think I can confidently say that we'll incorporate some appropriate mechanism into the refactored module (so far, we simply haven't had the need for it because the simulations always succeeded). The small challenge is see here is that a clean handling requires more than just returning the incomplete dataframe (your workaround) or passing through the exception (current logic) because:

In the former, we lose the error message plus it's not trivial to even identify that an error occurred.
In the latter, we lose the data.

Also, the mechanism needs to be compatible with all simulation layers we offer (i.e. simulating a single campaign vs simulating multiple campaigns, etc). However, I think I already have some good ideas how this can be accomplished.

That said, I've nothing against providing a quick workaround to unblock you, as long as the changes do not cause backward compatibility issues later. Let me draft a quick PR and see what my colleagues think about it 👍🏼 will tag you there.

AdrianSosic commented 7 months ago

Now, the more worrisome part. So far, I haven't experience any of the problems you describe. While I can offer trying to debug/investigate the botorch internals if we can come up with a minimal reproducing example, I would only do it as a last resort and first see if we get a better understanding of what's going on.

So here a few things we should consider first:

Have you already tried what has been suggested in the thread and checked if it mitigates the problem?
You are right that only one of the parameter duplicates is picked per simulation step, but be aware that the recommender might still suggest the same configuration in a later iteration in case the predictive model favors it! (From a statistical perspective, this is indeed the correct thing to do if we assume a noisy environment.) You can explicitly forbid this to happen by setting the allow_recommending_already_measured to False, which all our "pure" recommenders support. That way, we know for certain that no duplicates can appear in the training data throughout the simulation.
Perhaps the issue of "being close" is not only to be interpreted on the level of parameter configurations but instead also on individual feature columns, i.e. I could imagine that numerical issues can appear when columns are highly correlated or have show only little variation in their values. We have an open workitem that will allow to activate/control automatic feature decorrelation, but we are not fully there yet. Still, it could help to have a look at the comp_df of your searchspace to see if there is anything suspicious ...