Closed vlimant closed 1 month ago
there are cases where the output could have 0 events, and no issue in production
Could you provide an example? I find it strange that a dataset with 0 events can be considered a successful MC production.
https://cms-pdmv-prod.web.cern.ch/mcm/requests?prepid=BPH-Run3Summer22EENanoAODv12-00122 would then toggle done
is the example. For whichever reason the nano has failed, as long as it exists, valid, the request can toggle forward.
that does not prevent one to reset and resubmit post-facto
this bypass could be added under the restapi/requests/inspect/<prepid>/force
BTW
I do not think a dataset with 0 events can be considered done
. If we send them to submit-done
, we will forget about them until someone notices and pings us. This is a hack, not a proper solution.
This should have been flagged as an error by CompOps in the first place (why is it not?). It would also be a good case for an error state in our state machine, but as I already wrote here I am reluctant to touching this as the McM code is generally brittle.
Isn't the proper solution for this request to invalidate, reset, and resubmit? We could loosen the requirements for this action.
ops does not release stuff with no stats without notifications. in this case, there must have been a communication about issues, and the decision was to "close like this" ; this being other datasets with stats, but the nano with 0.
putting it under the /force
might be a way to avoid this to go by too often without being noticed.
Ok... So we can have partially completed wfs with the same status a fully completed ones. That's new to me. Yikes.
Empty datasets are a terrible message to the rest of the collaboration. They should really be INVALID. Similarly, the failure status of the requests should somehow be recorded (and not just in their history).
Therefore I think we need to do something more than overriding the request status, at least invalidate the dataset so people do not see it on DAS. Since we don't have a request failure state, we could hack it in by adding some label.
Once we agree on the required actions, we can implement something with these semantics. To avoid duplicated work, it would be best to have a single endpoint acting for both Ops and McM. (Since there seems to be one on the Ops side, we should probably detect the condition and update our status accordingly.)
for now, we toggle to done. further handling of datasets with no event is another story and will go with https://gitlab.cern.ch/cms-ppd/dataset-management/dm-coordination/-/issues/20 likely and other ways the information comes back from ops automatically without human handshaking.
What is the rationale for keeping the datasets VALID and not keeping track of affected requests?
If there is a plan for relevant information to come automatically in the short term, we can just patch the database once this is implemented and there is no need for a new endpoint.
mcm is not there to keep track of affected requests per say (hence no need to additional "end point" ; this feed back from ops on unrenable configuration should end up and does end up as a CMSSW GH issue and fix. mcm requests managers can add a note in the "notes" field about the reason for lower stats than expected (could be part of https://gitlab.cern.ch/cms-ppd/dataset-management/dm-coordination/-/issues/20). for now, these "dead" requests simply need to move forward to "done"
within
inspect_submitted
https://github.com/cms-PdmV/cmsPdmV/blob/018739d475bc33509f0b64be877445f4fb103d14/mcm/json_layer/request.py#L2287
there are cases where the output could have 0 events, and no issue in production. The check should be modified to
if zero stats AND not valid => says something is wrong.
https://cms-pdmv-prod.web.cern.ch/mcm/requests?prepid=BPH-Run3Summer22EENanoAODv12-00122 would then toggle done