Closed jmf1sh closed 2 years ago
After some investigation, the issue comes from the way RunSession. _start_trials
deals with trials that are started but take a long time to be known by the trial-datastore
. The current code basically gives the 5000ms to the trial to generate its first sample (the default timeout for TrialDatastoreClient.retrieve_trials
) and then start listening to it no matter what. If a trial takes longer, the above exception is raised.
To sidestep the issue, simply increase the default timeout in TrialDatastoreClient.retrieve_trials
.
The proper fix would solve two issues:
Traceback
Seems to be a race condition or an eviction issue in the trial datastore, and is exacerbated by increasing the number of parallel trials.