flux-framework / flux-coral2

Plugins and services for Flux on CORAL2 systems
GNU Lesser General Public License v3.0
9 stars 7 forks source link

Saving some workflow info to job KVS #180

Open jameshcorbett opened 1 month ago

jameshcorbett commented 1 month ago

The flux-coral2-dws service currently logs a lot of information about workflows to its journal. However, @bdevcich requested that some k8s Workflow info (which can be formatted as JSON) be saved to Flux for lookup. Perhaps to the KVS somewhere? @grondo , any thoughts? If we were just to dump a workflow once, I think it might be something like 4KB total.

jameshcorbett commented 1 month ago

@bdevcich what would be the highest-priority things for us to save to Flux? The state of the workflow resource immediately before Teardown?

grondo commented 1 month ago

@grondo , any thoughts? If we were just to dump a workflow once, I think it might be something like 4KB total.

Seems like could just go into the job KVS. Then the information should be available via flux job info JOBID KEY.

bdevcich commented 1 month ago

@bdevcich what would be the highest-priority things for us to save to Flux? The state of the workflow resource immediately before Teardown?

Yes. Saving the workflow is the highest priority. There might be some other resources that would also make sense to provide auxiliary information.

And then there is also the NnfDataMovement resources, which could be a lot of information.

jameshcorbett commented 1 month ago

@grondo , any thoughts? If we were just to dump a workflow once, I think it might be something like 4KB total.

Seems like could just go into the job KVS. Then the information should be available via flux job info JOBID KEY.

Sounds great. What's the guidance on sizes of objects allowed in the KVS?

@bdevcich I merged a PR just now that dumps the nnfdatamovement resources to the journal, so we'll at least have that.

garlick commented 1 month ago

Sounds great. What's the guidance on sizes of objects allowed in the KVS?

Go light if you can in the system instance. I would have no concerns about the 4K proposed above. If it starts to look like 100K maybe take a step back and ask if you need all that. This just basically piles up in the sqlite db on the management node in /var/lib/flux.