Open chu11 opened 5 years ago
As I begin working on getting wreck parity in #2241, I realize this is going to be a requirement. The job-info module will smartly watch an eventlog in the guest KVS namespace, but there is no guarantee that an eventlog, such as "guest.output" has been created by the time the watch occurs.
Dumb question: Is there a case where you wouldn't want to watch an eventlog with the equivalent of the WAITCREATE flag?
Hmmm. Well, when you want to watch the main eventlog, you probably don't want WAITCREATE. If the main eventlog doesn't exist, that should be an immediate error. So I I'm mostly thinking there are scenarios where the user knows that an eventlog should exist by some point, and it's an error if it doesn't?
This came up specifically b/c of the "guest.output" path. When the "start" event in the main eventlog occurs, we know the guest KVS namespace has been created. But there is a small racy part where we don't know exactly when the shell will create the "output" path.
What is perhaps interesting is that we probably do not want a generic WAITCREATE flag, but perhaps a WAITCREATE under certain circumstances. For example, WAITCREATE makes sense for an eventlog in the guest namespace, but not in the main eventlog (i.e. the job has ended and the namespace has been moved into the main namespace).
Oh crap, there's a huge race condition here that I did not think about (took me all afternoon to hunt this down in failed tests).
WAITCREATE works on missing entries AND missing namespaces. Up to this point, we've worked with WAITCREATE and namespaces that we (generally speaking) controlled the destruction.
But that's not the case with the job-info module. It's watching a guest namespace that another entity controls create/destroy. What if the guest namespace has already been destroyed before the WAITCREATE is issued? The ENOTSUP from "not yet created" will lead to WAITCREATE waiting for a namespace that will never get created.
This is related to #2355/ #2356 that I created earlier today.
I'm going to have to think about this. Perhaps the solution is not "WAITCREATE", but something else. Or perhaps there should be a WAITCREATE for namespaces vs entries. I dunno, it's going to take time to think this through.
brainstorming with @garlick, several alternate solutions were thought up, such as "in parallel", monitoring of the main eventlog or job state transitions. But the most promising idea was to actual output to the exec.eventlog that the guest.output directory is available.
The side affect is that flux job attach
will have to watch two eventlogs now.
For the purposes of reading from guest.output
safely, this is no longer needed in flux job attach
. However, I will leave the issue b/c it could be a "nice to have" in the future.
This issue has been automatically marked as stale because it has not had activity for 365 days. It will be closed if no further activity occurs within 14 days. Thank you for your contributions.
As discussed in #2332, it can be annoying if users have to check if an eventlog exists before using the job info guest eventlog watcher on it. Supporting WAITCREATE, or a similar mechanism, could be convenient.