Open bdevcich opened 1 week ago
We determined that this is most likely a flux issue and there is no race condition here.
What is happening is that flux is picking different computes and those computes are not tied to the same rabbit that has the persistent filesystem.
So if persistent gfs2 is created on rabbit-0 and then a compute from rabbit-1 tries to use it, it cannot mount the filesystem on the compute.
This does not appear to be an issue with lustre since I can be mounted from anywhere.
With system-test where workflows are running in parallel (
J>1
), it can hit a case where the persistent usage tests are racing with the destroy case. What happens is that the workflow using the persistent storage can't finishPreRun
because the destroy workflow beats it. Then both workflows are stuck until the usage workflow is removed.So we end up with:
NnfAccess
for the usage workflow says:There's no ClientMount yet.
The destroy workflow says:
Could the destroy proposal check the directivebreakdowns for any use of the directive name before it can leave Proposal? That way, as long as any usage workflow is out of proposal, there should be a directive breakdown that contains the persistent name:
Then the destroy can't get out of proposal until there are no directivebreakdowns left that contain that persistent name.