job-info: WAITCREATE option on eventlogs

flux-framework / flux-core

core services for the Flux resource management framework

GNU Lesser General Public License v3.0

167 stars 50 forks source link

job-info: WAITCREATE option on eventlogs #2346

Open chu11 opened 5 years ago

chu11 commented 5 years ago

As discussed in #2332, it can be annoying if users have to check if an eventlog exists before using the job info guest eventlog watcher on it. Supporting WAITCREATE, or a similar mechanism, could be convenient.

chu11 commented 5 years ago

As I begin working on getting wreck parity in #2241, I realize this is going to be a requirement. The job-info module will smartly watch an eventlog in the guest KVS namespace, but there is no guarantee that an eventlog, such as "guest.output" has been created by the time the watch occurs.

grondo commented 5 years ago

Dumb question: Is there a case where you wouldn't want to watch an eventlog with the equivalent of the WAITCREATE flag?

chu11 commented 5 years ago

Hmmm. Well, when you want to watch the main eventlog, you probably don't want WAITCREATE. If the main eventlog doesn't exist, that should be an immediate error. So I I'm mostly thinking there are scenarios where the user knows that an eventlog should exist by some point, and it's an error if it doesn't?

This came up specifically b/c of the "guest.output" path. When the "start" event in the main eventlog occurs, we know the guest KVS namespace has been created. But there is a small racy part where we don't know exactly when the shell will create the "output" path.

chu11 commented 5 years ago

What is perhaps interesting is that we probably do not want a generic WAITCREATE flag, but perhaps a WAITCREATE under certain circumstances. For example, WAITCREATE makes sense for an eventlog in the guest namespace, but not in the main eventlog (i.e. the job has ended and the namespace has been moved into the main namespace).

chu11 commented 5 years ago

Oh crap, there's a huge race condition here that I did not think about (took me all afternoon to hunt this down in failed tests).

WAITCREATE works on missing entries AND missing namespaces. Up to this point, we've worked with WAITCREATE and namespaces that we (generally speaking) controlled the destruction.

But that's not the case with the job-info module. It's watching a guest namespace that another entity controls create/destroy. What if the guest namespace has already been destroyed before the WAITCREATE is issued? The ENOTSUP from "not yet created" will lead to WAITCREATE waiting for a namespace that will never get created.

This is related to #2355/ #2356 that I created earlier today.

I'm going to have to think about this. Perhaps the solution is not "WAITCREATE", but something else. Or perhaps there should be a WAITCREATE for namespaces vs entries. I dunno, it's going to take time to think this through.

chu11 commented 5 years ago

brainstorming with @garlick, several alternate solutions were thought up, such as "in parallel", monitoring of the main eventlog or job state transitions. But the most promising idea was to actual output to the exec.eventlog that the guest.output directory is available.

The side affect is that flux job attach will have to watch two eventlogs now.

chu11 commented 5 years ago

For the purposes of reading from guest.output safely, this is no longer needed in flux job attach. However, I will leave the issue b/c it could be a "nice to have" in the future.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had activity for 365 days. It will be closed if no further activity occurs within 14 days. Thank you for your contributions.