cms-PdmV / cmsPdmV

CERN CMS McM repository
4 stars 10 forks source link

revise part of inspect_submitted #1147

Closed vlimant closed 1 month ago

vlimant commented 1 month ago

within inspect_submitted

https://github.com/cms-PdmV/cmsPdmV/blob/018739d475bc33509f0b64be877445f4fb103d14/mcm/json_layer/request.py#L2287

                # in case it keeps any output and produced 0 events
                # means something is wrong in production
                if (self.get_attribute('completed_events') <= 0
                        and self.get_attribute("keep_output").count(True) > 0):

                    not_good.update({
                        'message': '%s completed but with no statistics. stats DB lag. saving the request anyway.' % (
                                wma_r['content']['pdmv_dataset_name'])})

there are cases where the output could have 0 events, and no issue in production. The check should be modified to

if zero stats AND not valid => says something is wrong.

https://cms-pdmv-prod.web.cern.ch/mcm/requests?prepid=BPH-Run3Summer22EENanoAODv12-00122 would then toggle done

lmoureaux commented 1 month ago

there are cases where the output could have 0 events, and no issue in production

Could you provide an example? I find it strange that a dataset with 0 events can be considered a successful MC production.

vlimant commented 1 month ago

https://cms-pdmv-prod.web.cern.ch/mcm/requests?prepid=BPH-Run3Summer22EENanoAODv12-00122 would then toggle done

is the example. For whichever reason the nano has failed, as long as it exists, valid, the request can toggle forward.

that does not prevent one to reset and resubmit post-facto

vlimant commented 1 month ago

this bypass could be added under the restapi/requests/inspect/<prepid>/force BTW

lmoureaux commented 1 month ago

I do not think a dataset with 0 events can be considered done. If we send them to submit-done, we will forget about them until someone notices and pings us. This is a hack, not a proper solution.

This should have been flagged as an error by CompOps in the first place (why is it not?). It would also be a good case for an error state in our state machine, but as I already wrote here I am reluctant to touching this as the McM code is generally brittle.

Isn't the proper solution for this request to invalidate, reset, and resubmit? We could loosen the requirements for this action.

vlimant commented 1 month ago

ops does not release stuff with no stats without notifications. in this case, there must have been a communication about issues, and the decision was to "close like this" ; this being other datasets with stats, but the nano with 0.

putting it under the /force might be a way to avoid this to go by too often without being noticed.

lmoureaux commented 1 month ago

Ok... So we can have partially completed wfs with the same status a fully completed ones. That's new to me. Yikes.

Empty datasets are a terrible message to the rest of the collaboration. They should really be INVALID. Similarly, the failure status of the requests should somehow be recorded (and not just in their history).

Therefore I think we need to do something more than overriding the request status, at least invalidate the dataset so people do not see it on DAS. Since we don't have a request failure state, we could hack it in by adding some label.

Once we agree on the required actions, we can implement something with these semantics. To avoid duplicated work, it would be best to have a single endpoint acting for both Ops and McM. (Since there seems to be one on the Ops side, we should probably detect the condition and update our status accordingly.)

vlimant commented 1 month ago

for now, we toggle to done. further handling of datasets with no event is another story and will go with https://gitlab.cern.ch/cms-ppd/dataset-management/dm-coordination/-/issues/20 likely and other ways the information comes back from ops automatically without human handshaking.

lmoureaux commented 1 month ago

What is the rationale for keeping the datasets VALID and not keeping track of affected requests?

If there is a plan for relevant information to come automatically in the short term, we can just patch the database once this is implemented and there is no need for a new endpoint.

vlimant commented 1 month ago

mcm is not there to keep track of affected requests per say (hence no need to additional "end point" ; this feed back from ops on unrenable configuration should end up and does end up as a CMSSW GH issue and fix. mcm requests managers can add a note in the "notes" field about the reason for lower stats than expected (could be part of https://gitlab.cern.ch/cms-ppd/dataset-management/dm-coordination/-/issues/20). for now, these "dead" requests simply need to move forward to "done"