LLNL / merlin

Machine Learning for HPC Workflows
MIT License
118 stars 26 forks source link

[BUG] new “merlin status” on old workflows #469

Closed lucpeterson closed 2 months ago

lucpeterson commented 5 months ago

Bug Report

Describe the bug It looks like if you run the new “status” command on a workflow that was run with old merlin, it will just hang as it searches for the status files.

To Reproduce Steps to reproduce the behavior:

  1. Run a workflow with an older version of merlin
  2. Call merlin status on that with the newer version

Expected behavior Some kind of warning, a direction to use the old command, and a graceful exit

bgunnar5 commented 5 months ago

So that I can keep track of them all, I'm adding on to this thread with a few additional bugs with the status command that I've encountered:

  1. When a step is only using one sample, the status file is still "condensed" but then the file is just removed so you can't view the status of that step at all
  2. The worker for a step is expected to be just a single worker, however several workers can process a single step (think restarting a step or multiple workers processing the same step). When multiple workers process a step, the statuses cannot be condensed properly and can cause a workflow to fail
  3. For dry runs, the name has a typo when trying to render the progress bar

For detailed status, I need to add in the export MANPAGER="less -r" call behind the scenes so users don't have to do this.

bgunnar5 commented 5 months ago

@lucpeterson I'm looking into the bug you mentioned now but I'm unable to reproduce. When I run a study with merlin v1.11.1 and then try to check the status with merlin v1.12.0 it just shows empty progress bars: image

Which study were you running that caused this? Maybe there's a certain case where this happens but won't otherwise

lucpeterson commented 2 months ago

This was not actually a problem with status but rather a server timeout delay that made it look like it hangs