aimhubio / aim

Aim 💫 — An easy-to-use & supercharged open-source experiment tracker.
https://aimstack.io
Apache License 2.0
5.15k stars 315 forks source link

Metrics explorer fails to load due to some run metrics not having `last_step` in metadata #2554

Open schrobot opened 1 year ago

schrobot commented 1 year ago

🐛 Bug

We're using PyTorch Lightning and Ray Tune, and the remote tracking server for Aim. We're on version 3.15.2. We started encountering an issue where the Metrics Explorer fails to load (hangs on "Searching over runs") some runs (ie if I query for run.hash == '<hash of broken run>', or if I run any query that includes a broken run). Looking at the stack trace in the Aim UI logs, it reports:

File ".../.venv/lib/python3.10/site-packages/aim/sdk/sequence.py", line 193, in numpy
    last_step = self.meta_tree['last_step']
  File "aim/storage/treeview.py", line 51, in aim.storage.treeview.TreeView.__getitem__
  File "aim/storage/containertreeview.py", line 74, in aim.storage.containertreeview.ContainerTreeView.collect
KeyError: "No key ('last_step',) is present."

In the Run Details -> Metrics page, the run does show some of the metrics. One of the metrics fails to load its chart (spinning wheel). This is the metric whose sequence has the above issue.

This Run belonged to a Ray Tune Trial that was terminated after 1 validation run. The validation metric is the last metric that should have been reported to Aim. From Ray's side, the trial was terminated successfully, and Ray was reported the validation metric.

When I manually query for the Sequence object for this run's metric, via the SDK, I see

seq
---
<Sequence#938752977989852897 name=`loss` context=`<Context#5190394695475244853 {'subset': 'val'}>` run=`<Run#-2273509748222130010 name=b13c0b51dc9f43c3832c42b5 repo=<... read_only=None>>`>

dict(seq.data.meta_tree)
---
{'dtype': 'float',
 'first_step': 1023,
 'last': 0.13889867067337036,
 'version': 2}

so Aim received the metric value, but for some reason did not mark the last step.

It appears the issue with loading the run stems from calling .sample(...) on this SequenceV2Data which internally has no steps:

running list(seq.data.steps.values()) yields []

To reproduce

Don't have a consistent repro yet.

Expected behavior

Environment

Additional context

schrobot commented 1 year ago

I did a bit more digging, and found 11 runs in total from my Tune job that suffered from this issue. All of them were terminated after the 1st validation iteration. The runs had differing amounts of metadata recorded in the respective seq.data.meta_tree; some had just first_step but no last, some had only dtype and version.

So it seems that something with the interaction between Ray -> PyTorch Lightning -> Aim Logger -> Aim Remote Tracking is breaking down when a Trial is terminated.

alberttorosyan commented 1 year ago

The following PR #2562 addresses the data inconsistency while reading the metric sequences. Full fix (preventing such cases during write operations) requires more time to implement, and should be done by a separate PR.