Matgenix / qtoolkit

https://matgenix.github.io/qtoolkit/
Other
7 stars 2 forks source link

Record job and DB ID in slurm job #40

Open janosh opened 2 weeks ago

janosh commented 2 weeks ago

sbatch has a handy --comment that allows adding metadata to jobs:

sbatch -h | grep comment
      --comment=name          arbitrary comment

i'd like to use this to connect my slurm jobs back to jobs in the database either via their jobflow or their database ID. i saw there's no explicit support for comment in SlurmIO:

https://github.com/Matgenix/qtoolkit/blob/e95110ab51f747128bbd290a324ef7ba64db5d0b/src/qtoolkit/io/slurm.py#L148-L176

maybe qverbatim is meant as an escape hatch for situations like this? didn't find any docs on it. also not sure how to set qverbatim. this raises

QResources(
    processes=16, job_name="name", qverbatim="test"
).as_dict()
>>> TypeError: QResources.__init__() got an unexpected keyword argument 'qverbatim'

maybe like this? didn't try yet but either way would be good to document how to pass qverbatim

QResources(
    processes=16, job_name="name", scheduler_kwargs={"qverbatim": "test"}
).as_dict()
davidwaroquiers commented 1 week ago

Hi @janosh

Thanks for the question. Regarding qverbatim, indeed, it should be used as in your second example:

qr = QResources(
    processes=16,
    job_name="name",
    scheduler_kwargs={"qverbatim": '#SBATCH --comment="this is my comment"'}
)

QResources is indeed meant to be used as a common object for specifying resources. Difficulty is that not all DRM's provide the same functionalities. This is why there is the scheduler_kwargs, allowing you to pass anything you want by yourself (through qverbatim in slurm).

Now if comment were to be a commonly (and common) option in DRM's (well currently qtoolkit only supports PBS and slurm), we could add it to QResources. I'm not sure PBS has an "equivalent".

Now maybe there is a different question to be raised in why you want to have the dbid in the comment and maybe there is a different way to do that (maybe in jobflow-remote) ? Open to discuss if needed.

Pinging @gpetretto in case I said something wrong here :-)

janosh commented 1 week ago

thanks @davidwaroquiers, that's very helpful!

i wanted the DB ID in the comment to be able to map from the output of squeue which only shows slurm IDs back to the corresponding DB entries. i asked this question before realizing that scontrol show job <slurm-id> shows the run_dir which in turn contains the job's UUID which I can use to get the corresponding DB entry. so like you said, there is a solution that doesn't require SBATCH --comment.

i'll still be using --comment to record things like formula, n_sites, volume as job metadata since it makes debugging jobs that take an inordinate amount of time easier. currently, it's a slow process that i have to repeat for dozens of calcs to map from the slurm ID to the run_dir or DB entry to see if it's just a large system of if the calculation uses bad settings and stalled for some reason.