Open schrobot opened 1 year ago
I did a bit more digging, and found 11 runs in total from my Tune job that suffered from this issue. All of them were terminated after the 1st validation iteration. The runs had differing amounts of metadata recorded in the respective seq.data.meta_tree
; some had just first_step
but no last
, some had only dtype
and version
.
So it seems that something with the interaction between Ray -> PyTorch Lightning -> Aim Logger -> Aim Remote Tracking is breaking down when a Trial is terminated.
The following PR #2562 addresses the data inconsistency while reading the metric sequences. Full fix (preventing such cases during write operations) requires more time to implement, and should be done by a separate PR.
🐛 Bug
We're using PyTorch Lightning and Ray Tune, and the remote tracking server for Aim. We're on version 3.15.2. We started encountering an issue where the Metrics Explorer fails to load (hangs on "Searching over runs") some runs (ie if I query for
run.hash == '<hash of broken run>'
, or if I run any query that includes a broken run). Looking at the stack trace in the Aim UI logs, it reports:In the Run Details -> Metrics page, the run does show some of the metrics. One of the metrics fails to load its chart (spinning wheel). This is the metric whose sequence has the above issue.
This Run belonged to a Ray Tune Trial that was terminated after 1 validation run. The validation metric is the last metric that should have been reported to Aim. From Ray's side, the trial was terminated successfully, and Ray was reported the validation metric.
When I manually query for the
Sequence
object for this run's metric, via the SDK, I seeso Aim received the metric value, but for some reason did not mark the last step.
It appears the issue with loading the run stems from calling
.sample(...)
on thisSequenceV2Data
which internally has nosteps
:running
list(seq.data.steps.values())
yields[]
To reproduce
Don't have a consistent repro yet.
Expected behavior
last_step
is correctly tracked on the Aim side in this settingEnvironment
Additional context