List database and print fail trace

dfuchsgruber commented 1 year ago

What does this implement/fix?

This PR adds two convenience functions to SEML:

Listing

seml list [pattern] will list all collections in the database that matches an (optional) RegEx pattern. It will also print the state counts for all experiments in the collection. Note that in contrast to seml {collection} status, it does not verify that experiments with state RUNNING are actually still running and have not died or been interupted. This is to ensure the command runs quickly.

Example output:

seml list .*_pneuma
WARNING: Status of RUNNING experiments may not reflect if they have died or been canceled. Use `seml ... status` instead.
+------------+--------+---------+---------+--------+--------+-------------+-----------+-------+
| Collection | STAGED | PENDING | RUNNING | FAILED | KILLED | INTERRUPTED | COMPLETED | Total |
+------------+--------+---------+---------+--------+--------+-------------+-----------+-------+
| gat_pneuma |      0 |       0 |       0 |      7 |      0 |           0 |        65 |    72 |
| gcn_pneuma |      0 |       0 |       2 |      1 |      0 |           0 |        45 |    48 |
| hpo_pneuma |      0 |       0 |       0 |     24 |     49 |           0 |        47 |   120 |
+------------+--------+---------+---------+--------+--------+-------------+-----------+-------+

Fail Trace Printing

seml {collection} print-fail-trace will print the fail trace of all experiments with "failed states" (FAILED, KILLED, INTERRUPTED). This way, you don't have to use a MongoDB client everytime you want to inspect error logs.

Example output (shortened, to only show the output format):

seml gat_pneuma print-fail-trace
***** Experiment ID 1, status: FAILED, slurm array-id, task-id: 7921342-0 *****
        Traceback (most recent call last):
          File "/nfs/staff-ssd/fuchsgru/miniconda3/envs/trajectory_gnn/lib/python3.10/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 92, in _call_target
    ...
TypeError("new() received an invalid combination of arguments - got (str, int), but expected one of:\n * (*, torch.device device)\n      didn't match because some of the arguments have invalid types: (\x1b[31;1mstr\x1b[0m, \x1b[31;1mint\x1b[0m)\n * (torch.Storage storage)\n * (Tensor other)\n * (tuple of ints size, *, torch.device device)\n * (object data, *, torch.device device)\n")
full_key: model.layers0

***** Experiment ID 35, status: FAILED, slurm array-id, task-id: 7930611-2 *****
        Traceback (most recent call last):
          File "/nfs/homedirs/fuchsgru/seml/seml/hydra.py", line 98, in decorator
    result = func(cfg)
          ...
        torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.75 GiB (GPU 0; 39.59 GiB total capacity; 37.05 GiB already allocated; 1.21 GiB free; 37.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

...

n-gao commented 1 year ago

Thanks a lot for the great contribution! I did not yet have the time to fully review but here are a few things I noticed. For seml list:

seml list fails if a collection is not created by seml or contains non-seml items.
In consistency with the rest of the code base and for better configurability, we use logging.info instead of print
Should we use aggregate sum of PyMongo instead of summing in python? That would reduce the network load and offload more to the database which typically is the most efficient way.
Should we just use pandas for pretty printing since we already have it as a requirement?
For backward and future compatibility, we should iterative over all aliases for the same state rather than taking only the first one.

For print-fail-trace:

logging.info instead of print
Probably add a final message saying how many experiments have been crawled and how many have been printed. This is to catch typos in the collection name since atm the program just exits without any message.

dfuchsgruber commented 1 year ago

Thanks for the feedback, I hope I addressed your points. Remaining remarks:

Should we just use pandas for pretty printing since we already have it as a requirement?

I implemented it this way now and removed the prettytable dependency, but imho I find the result visually less appealing. A matter of taste in the end. It now looks like this:

seml list .*pneuma
WARNING: Status of RUNNING experiments may not reflect if they have died or been canceled. Use `seml ... status` instead.
                STAGED  PENDING  RUNNING  FAILED  KILLED  INTERRUPTED  COMPLETED  Total
Collection                                                                             
gat_pneuma           0        0        0       7       0            0         65     72
gcn_pneuma           0        0        2       1       0            0         45     48
hpo_pneuma           0        0        0      24      49            0         47    120
pneuma               0        0        1       0       0            0          9     10
pneuma_spatial       0        0        0       2       1            0         13     16

TUM-DAML / seml