Metrics explorer fails to load due to some run metrics not having `last_step` in metadata

schrobot commented 1 year ago

🐛 Bug

We're using PyTorch Lightning and Ray Tune, and the remote tracking server for Aim. We're on version 3.15.2. We started encountering an issue where the Metrics Explorer fails to load (hangs on "Searching over runs") some runs (ie if I query for run.hash == '<hash of broken run>', or if I run any query that includes a broken run). Looking at the stack trace in the Aim UI logs, it reports:

File ".../.venv/lib/python3.10/site-packages/aim/sdk/sequence.py", line 193, in numpy
    last_step = self.meta_tree['last_step']
  File "aim/storage/treeview.py", line 51, in aim.storage.treeview.TreeView.__getitem__
  File "aim/storage/containertreeview.py", line 74, in aim.storage.containertreeview.ContainerTreeView.collect
KeyError: "No key ('last_step',) is present."

In the Run Details -> Metrics page, the run does show some of the metrics. One of the metrics fails to load its chart (spinning wheel). This is the metric whose sequence has the above issue.

This Run belonged to a Ray Tune Trial that was terminated after 1 validation run. The validation metric is the last metric that should have been reported to Aim. From Ray's side, the trial was terminated successfully, and Ray was reported the validation metric.

When I manually query for the Sequence object for this run's metric, via the SDK, I see

seq
---
<Sequence#938752977989852897 name=`loss` context=`<Context#5190394695475244853 {'subset': 'val'}>` run=`<Run#-2273509748222130010 name=b13c0b51dc9f43c3832c42b5 repo=<... read_only=None>>`>

dict(seq.data.meta_tree)
---
{'dtype': 'float',
 'first_step': 1023,
 'last': 0.13889867067337036,
 'version': 2}

so Aim received the metric value, but for some reason did not mark the last step.

It appears the issue with loading the run stems from calling .sample(...) on this SequenceV2Data which internally has no steps:

running list(seq.data.steps.values()) yields []

To reproduce

Don't have a consistent repro yet.

Expected behavior

The UI does not hang, and load as many of the valid metrics as it can
last_step is correctly tracked on the Aim side in this setting

Environment

Aim Version: 3.15.2
Python version: 3.10
pip version: 23
OS (e.g., Linux): Ubuntu, running Docker container
We are running Ray Tune experiments, using PyTorch Lightning trainers, logging to Aim via remote tracking.

Additional context

Aim remote tracking server running in a Docker container behind network load balancer in AWS
Ray version 2.3.0

schrobot commented 1 year ago

I did a bit more digging, and found 11 runs in total from my Tune job that suffered from this issue. All of them were terminated after the 1st validation iteration. The runs had differing amounts of metadata recorded in the respective seq.data.meta_tree; some had just first_step but no last, some had only dtype and version.

So it seems that something with the interaction between Ray -> PyTorch Lightning -> Aim Logger -> Aim Remote Tracking is breaking down when a Trial is terminated.

alberttorosyan commented 1 year ago

The following PR #2562 addresses the data inconsistency while reading the metric sequences. Full fix (preventing such cases during write operations) requires more time to implement, and should be done by a separate PR.

aimhubio / aim