Closed lawrence-chillrud closed 1 year ago
@lawrence-chillrud I think your solution of dealing with experiment files rather than the object returned by tuner.experiment
is preferable. The files are persistent and can be revisited anytime where the object is transient.
I did some investigation and wanted to mention that it's possible to achieve this using the object. The scope=all
parameter for ray.tune.ResultGrid.get_best_result
will return the single best epoch of all trials - not just the last epoch result. It is also possible to iterate over the Result
in the ResultGrid
object, calling Result.best_checkpoints
on each to build a longer list. Makes me feel a little better about Ray.
Should the outputs include the checkpoint directories?
This information is not contained in the results.json files but it can be inferred from the sorting index where you final_df.sort_values
.
Related to this - I don't see a way to report the best epoch in the trial in the reporter. I think it's hardwired to report the current trial state for running trials, and the last epoch for completed trials. For the MNIST example, the last epoch is the best for most trials due to the stopping criteria, but that may not be true for all projects.
I think we can live with this if we have our own analysis tools. The top-k can be run during an experiment so for long-running experiments we can still monitor progress.
Should the outputs include the checkpoint directories?
This information is not contained in the results.json files but it can be inferred from the sorting index where you
final_df.sort_values
.
Included the checkpoint_path
s of the trials in the final_df
returned by the function -- see column checkpoint_path
.
Note: if the metric
specified to get_top_k_trials
is not the same as the metric that was passed to the initial ray tune experiment (i.e., the metric ray tune is optimizing for), then many (or possibly all) of the checkpoints for those trials will not exist, and None
will be reported rather than a checkpoint_path
. E.g., if ray tune was told to optimize for balanced accuracy, but then the user passes in AUC as the metric to sort get_top_k_trials
, trials could be returned in the final_df
that have no saved checkpoint. However, if in this example the user passes in balanced accuracy as the metric, then a checkpoint_path
exists for every trial (provided drop_dups=True
) and will be returned accordingly.
Wrote a function to handle the issue that ray tune's experiment reports only show the last epoch performance rather than the best. For the details please see the function documentation!