dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

The new DQM GUI file management #10287

Open andrius-k opened 3 years ago

andrius-k commented 3 years ago

Impact of the new feature This request affects all systems that are responsible for harvested DQM data being uploaded to the DQM GUIs. This includes T0 processed DQM data and RelVal/reprocessing DQM data.

Is your feature request related to a problem? Please describe. We're deploying a new, upgraded version of the DQM GUI tool. The procedure which notifies the DQM GUI about the new DQM files is different in a new version. We would like this new procedure to be used along side the the old, visDQMUpload based DQM file upload.

Describe the solution you'd like Now, the DQM data is uploaded to the DQM GUIs using a tool called visDQMUpload. New procedure requires this process to be split into two stages:

If required, a facade could be provided by us (DQM) that would have exactly the same interface as visDQMUpload. In such case, we would only like you to call the facade script (visDQMUpload_new) alongside the old one.

Describe alternatives you've considered No viable, future proof alternatives were found.

Additional context Bellow is a diagram that represents the current Offline DQM file movement:

current-dqm-diagram

Bellow is a diagram that represents the desired Offline DQM file movement, after the changes mentioned in this request:

new-dqm-diagram2

khurtado commented 2 years ago

@rvenditti If you could get the mapping mentioned by @amaltaro (also pasted below) then no additional discussion is necessary. We can modify the Pull Request accordingly to take this into account and avoid creating a new workflow/spec parameter.

{"https://cmsweb.cern.ch/dqm/offline": "https://cmsweb.cern.ch/dqm/offline-new/api/v1/register",
 "https://cmsweb.cern.ch/dqm/relval": "https://cmsweb.cern.ch/dqm/relval-new/api/v1/register",
 "https://cmsweb.cern.ch/dqm/dev": "https://cmsweb.cern.ch/dqm/dev-new/api/v1/register"}
khurtado commented 2 years ago

@rvenditti @jfernan2 Looking at the JIRA ticket, we are close to getting the permanent EOS storage. Are there any news regarding the cmsweb mapping?

rvenditti commented 2 years ago

Hi, just to summarize the situation (for future reference): We have already created following urls in the cmsweb production clusters in the past to use the NEW DQM GUIs: https://cmsweb.cern.ch/dqm/offline-new/ (offline) https://cmsweb.cern.ch/dqm/relval-new/ (relval) These links actually do not point to any machine.

If the "new" part of this mapping is just a placeholder that will not be used at all for the time being (until we are ready), we can assume the “new” urls as the ones above:

So in summary, the mapping is as you proposed:

{"https://cmsweb.cern.ch/dqm/offline": "https://cmsweb.cern.ch/dqm/offline-new/api/v1/register", "https://cmsweb.cern.ch/dqm/relval": "https://cmsweb.cern.ch/dqm/relval-new/api/v1/register", "https://cmsweb.cern.ch/dqm/dev": "https://cmsweb.cern.ch/dqm/dev-new/api/v1/register"}

khurtado commented 2 years ago

@rvenditti Awesome! Thank you.

khurtado commented 2 years ago

@rvenditti I tested the new mapping, but I got an error while trying to register the files.

This link in particular: https://cmsweb.cern.ch/dqm/offline-new/api/v1/register

shows expired SSL certs

Secure Connection Failed

An error occurred during a connection to cmsweb.cern.ch. SSL peer rejected your certificate as expired.

Error code: SSL_ERROR_EXPIRED_CERT_ALERT

Could that be the issue? I thought it could be my certs, but if I use the browser for example, I can see: https://cmsweb-testbed.cern.ch/dqm/offline-test-new/ but not: https://cmsweb.cern.ch/dqm/offline-new/

@amaltaro: FYI

micsucmed commented 2 years ago

Hi @khurtado The problem with the mapping you're trying to do is that there is no host on this link, it's just an empty link for when the new GUI moves to production.

Could that be the issue? I thought it could be my certs, but if I use the browser for example, I can see: https://cmsweb-testbed.cern.ch/dqm/offline-test-new/ but not: https://cmsweb.cern.ch/dqm/offline-new/

Right now the new GUI is being migrated from cmsweb testbed VM to cmsweb testbed Kubernetes cluster, but deployment is very recent and some bugs are still present. After the deployment in the cmsweb testbed Kubernetes cluster is a success and stable, deployment in cmsweb Kubernetes production cluster will start. Then "https://cmsweb.cern.ch/dqm/offline-new/" will be available, but for now only "https://cmsweb-testbed.cern.ch/dqm/offline-test-new/" is available.

amaltaro commented 2 years ago

Hi everyone, I am trying to understand where we stand with these developments and if I understand the messages above correctly, we do not have any service running on some of those new urls. Is that correct? Is there an ETA to have such services up & running?

I am afraid we cannot proceed with these developments until we have all the dependency machinery in place. Otherwise WMAgents will try to reach to those backends and will fail the whole job, failing both old and new DQM mechanism. It could take from a few days to a few weeks to have it deployed in production, but to be on the safe side, we cannot merge it unless it's been fully tested and there are DQM services listening on the new urls.

Please let us know if there is anything missing here; and/or if there is anything that we can help you with to move this forward. Thanks

khurtado commented 2 years ago

@micsucmed @rvenditti @jfernan2 : Just pinging about this on what Alan asked last week. Are there any news or time estimates on when the new url mappings with services will be fully available/operational? We can't move forward with this until then.

micsucmed commented 2 years ago

Hi @amaltaro @khurtado, having a ETA for the new GUI to be running in the new urls is difficult as at the moment the deployment in Kubernetes testbed cluster is still ongoing and we are waiting on Cloud Infrastructure group to solve a problem with EOS mounting for the application within the cluster (Ticket: https://cern.service-now.com/service-portal?id=ticket&table=u_request_fulfillment&n=RQF2037818). Nonetheless, I'll keep you updated about the process and when the problem has been solved, so a more accurate ETA for the services to be running on the new urls can be given.

khurtado commented 2 years ago

@micsucmed Understood. Thank you for the update on this!

khurtado commented 2 years ago

Hi @micsucmed , we had a Workflow Management meeting today and we were wondering if there was any progress on this, e.g.: on the cloud infrastructure + EOS issue (CERN ticket just shows as empty to me)

micsucmed commented 2 years ago

Hi @khurtado, we haven't got any response from the assignee to the EOS ticket in a while (it might appear empty as it is a private ticket but I can add you to the watchers list if you like), so it's very difficult to say when this problem will be solved. It may be best if we continue with this using the testbed endpoint for the offline GUI ( https://cmsweb-testbed.cern.ch/dqm/offline-test-new/ ) and once the EOS problem is solved we change to the production endpoints.

khurtado commented 2 years ago

Hi @micsucmed. just pinging to see if there is any change in the status of this overall.

khurtado commented 2 years ago

@micsucmed @jfernan2 Since I can't read this ticket https://cern.service-now.com/service-portal?id=ticket&table=u_request_fulfillment&n=RQF2037818

Could you please let me know who is in charge of solving this ticket on the CERN side (if it hasn't been solved already)? Are there any other major issues besides the EOS issue preventinig this to move forward?

rvenditti commented 2 years ago

AFAWK, the main problem is solved and we are now finalizing some access issues, but @micsucmed can correct me and add a timescale for this.

What we really need before the data taking start-up is the automation of the rootfiles transfer on EOS (that was a part of the request), given that it is presently done by hand and we have other services that read from there.

So I would put this as top priority. Off course, we can still do the upload by hand, but given the amount of incoming files, this could become a nightmare for us. For the upload of the rootfiles in the new GUI, instead, we can survive like this for a couple of months indeed (we have the old GUI that is working). Now I see here: https://github.com/dmwm/WMCore/pull/11015 that the transfer to EOS is failing still due to mapping. Is it possible to decouple the two parts of the problem (i.e. transfer to eos and rootfile upload in the GUI) ?

khurtado commented 2 years ago

@rvenditti Thank you for the update!
Regarding the EOS transfer in #11015, this is working.

Here is a status summary from the WMCore side of things:

So, basically we are just waiting for the registration mapping to work. And yes please, a time estimate of when this would be done (the host services in the new cmsweb urls) would be great.

micsucmed commented 2 years ago

Hi @khurtado I would expect to finish the testbed deployment sometime this week, so I would say that at the end of next week the production endpoints will be available.

khurtado commented 2 years ago

Hi @micsucmed . Is there an update on this? I still see e.g.: this service link unavailable:

https://cmsweb.cern.ch/dqm/offline-new/api/v1/register

khurtado commented 2 years ago

@micsucmed Just pinging about this issue.

micsucmed commented 2 years ago

@khurtado sorry for the delay. The EOS issue has been fixed yesterday, it was fixed a couple weeks ago but a n update to the Kubernetes cluster reinstated the issue. I will deploy the testbed again and prepare for production deployment. At the end of this week or early next week I expect to have the production deployment ready for you to continue. Again sorry for the delays and my late response.

khurtado commented 2 years ago

Hi @micsucmed. Thank you! Sorry I took 2 weeks to reply back, but I hope the deployment plans are going well. Please let me know once the services have been deployed so we can re-test this.

khurtado commented 2 years ago

Hi @micsucmed. Any news on this?

khurtado commented 1 year ago

@micsucmed @rvenditti Pinging about this again.

rvenditti commented 1 year ago

Hi @khurtado , I understood for @micsucmed that the update of the offline GUI to k8 is done, but there are some problems on the cmsweb side. I don't have the details, but i think that problem can be solved in the time scale of some days. @micsucmed can you confirm?

micsucmed commented 1 year ago

Hi @khurtado, as @rvenditti say's there is some issue related to the frontend rules given by cmsweb so that the URL in production ( https://cmsweb.cern.ch/dqm/offline-new/ ) access the deployed pod in K8's. The issue is being solved by cmsweb team. I would like to think it's a simple issue that should be solved soon, nonetheless, as I am not the one with access to solving it I am not sure if this will be the case.

khurtado commented 1 year ago

@rvenditti @micsucmed Thank you for the update. Is there a ticket or GH issue to track this from the cmsweb team side?

rvenditti commented 1 year ago

Hare it is: https://its.cern.ch/jira/browse/CMSKUBERNETES-145

anpicci commented 9 months ago

@rvenditti how should we (WM Core) proceed on this? Are we supposed to continue working on this, considering that the new DQM GUI is already in place? @khurtado are there any other tests to be done before merging and deploying?