TUM-DAML / seml

SEML: Slurm Experiment Management Library
Other
165 stars 29 forks source link

Descriptions #118

Closed dfuchsgruber closed 1 year ago

dfuchsgruber commented 1 year ago

Adds per-experiment descriptions to SEML

What does this implement/fix?

This allows the user to add descriptions on a per-experiment basis to MongoDB. The preferred way to do so should be to add a description key to the experiment YAML-file's seml section, as in:

...
seml:
   description: My first experiment.
...

This description is available when running the following commands:

Both seml list and seml {collection} status support the --full-descriptions flag, that will print full descriptions instead of truncating them such that they do not exceed a line, e.g.:

  Collection Staged Pending Running Failed Killed Interrupted Completed Total Description(s)                        
  ────────────────────────────────────────────────────────────────────────────────────────────────────────────────  
  test            0      67       1      0      0           3         1    72 Test Description                      
  test3         160       0       0      0      0           0         0   160 "Single Experiment Description",      
                                                                              "Other Test Description"              
  ────────────────────────────────────────────────────────────────────────────────────────────────────────────────  
  Total         160      67       1      0      0           3         1   232 

The descriptions can also retroactively be set (=updated) or deleted via the new commands:

Projections to fail-trace-printing

Somewhat orthogonal to experiment descriptions: The seml {collection} print-fail-trace command was extended to also display additional config fields requested by the user via the -p/ --projection flag, e.g.

seml gal_cora_35 print-fail-trace -p '["config.data.num_splits", "config.model.num_inits"]'
╭───────────────── Experiment ID 95, Batch ID 5, Status: "KILLED", Slurm Array-Task id: 8169698-5 ─────────────────╮
│ [18:23:56][__main__][INFO] Dataset split 4, Model initialization 4                                               │
│         [19:50:58][__main__][INFO] Dataset split 4, Model initialization 4                                       │
│         [20:12:40][__main__][INFO] Dataset split 4, Model initialization 1                                       │
│         slurmstepd: error: *** JOB 8169760 ON gpu12 CANCELLED AT 2023-05-07T20:17:00 DUE TO TIME LIMIT ***       │
╰─────── Description : GAL on CoraML with 35 nodes, config.data.num_splits : 5, config.model.num_inits : 5 ────────╯

As I encountered looking into the MongoDB for what parameters might have triggered experiment failure, this is another IMHO very convenient functionality.

Additional information

n-gao commented 1 year ago

Thanks a lot! Great contribution. A few more comments:

n-gao commented 1 year ago

I additionally implemented seml xyz description list.

n-gao commented 1 year ago

LGTM