TUM-DAML / seml

SEML: Slurm Experiment Management Library
Other
165 stars 29 forks source link

List database and print fail trace #114

Closed dfuchsgruber closed 1 year ago

dfuchsgruber commented 1 year ago

What does this implement/fix?

This PR adds two convenience functions to SEML:

Listing

seml list [pattern] will list all collections in the database that matches an (optional) RegEx pattern. It will also print the state counts for all experiments in the collection. Note that in contrast to seml {collection} status, it does not verify that experiments with state RUNNING are actually still running and have not died or been interupted. This is to ensure the command runs quickly.

Example output:

seml list .*_pneuma
WARNING: Status of RUNNING experiments may not reflect if they have died or been canceled. Use `seml ... status` instead.
+------------+--------+---------+---------+--------+--------+-------------+-----------+-------+
| Collection | STAGED | PENDING | RUNNING | FAILED | KILLED | INTERRUPTED | COMPLETED | Total |
+------------+--------+---------+---------+--------+--------+-------------+-----------+-------+
| gat_pneuma |      0 |       0 |       0 |      7 |      0 |           0 |        65 |    72 |
| gcn_pneuma |      0 |       0 |       2 |      1 |      0 |           0 |        45 |    48 |
| hpo_pneuma |      0 |       0 |       0 |     24 |     49 |           0 |        47 |   120 |
+------------+--------+---------+---------+--------+--------+-------------+-----------+-------+

Fail Trace Printing

seml {collection} print-fail-trace will print the fail trace of all experiments with "failed states" (FAILED, KILLED, INTERRUPTED). This way, you don't have to use a MongoDB client everytime you want to inspect error logs.

Example output (shortened, to only show the output format):

seml gat_pneuma print-fail-trace
***** Experiment ID 1, status: FAILED, slurm array-id, task-id: 7921342-0 *****
        Traceback (most recent call last):
          File "/nfs/staff-ssd/fuchsgru/miniconda3/envs/trajectory_gnn/lib/python3.10/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 92, in _call_target
    ...
TypeError("new() received an invalid combination of arguments - got (str, int), but expected one of:\n * (*, torch.device device)\n      didn't match because some of the arguments have invalid types: (\x1b[31;1mstr\x1b[0m, \x1b[31;1mint\x1b[0m)\n * (torch.Storage storage)\n * (Tensor other)\n * (tuple of ints size, *, torch.device device)\n * (object data, *, torch.device device)\n")
full_key: model.layers0

***** Experiment ID 35, status: FAILED, slurm array-id, task-id: 7930611-2 *****
        Traceback (most recent call last):
          File "/nfs/homedirs/fuchsgru/seml/seml/hydra.py", line 98, in decorator
    result = func(cfg)
          ...
        torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.75 GiB (GPU 0; 39.59 GiB total capacity; 37.05 GiB already allocated; 1.21 GiB free; 37.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

...
n-gao commented 1 year ago

Thanks a lot for the great contribution! I did not yet have the time to fully review but here are a few things I noticed. For seml list:

For print-fail-trace:

dfuchsgruber commented 1 year ago

Thanks for the feedback, I hope I addressed your points. Remaining remarks:

  • Should we just use pandas for pretty printing since we already have it as a requirement?

I implemented it this way now and removed the prettytable dependency, but imho I find the result visually less appealing. A matter of taste in the end. It now looks like this:

seml list .*pneuma
WARNING: Status of RUNNING experiments may not reflect if they have died or been canceled. Use `seml ... status` instead.
                STAGED  PENDING  RUNNING  FAILED  KILLED  INTERRUPTED  COMPLETED  Total
Collection                                                                             
gat_pneuma           0        0        0       7       0            0         65     72
gcn_pneuma           0        0        2       1       0            0         45     48
hpo_pneuma           0        0        0      24      49            0         47    120
pneuma               0        0        1       0       0            0          9     10
pneuma_spatial       0        0        0       2       1            0         13     16