dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

The new DQM GUI file management #10287

Open andrius-k opened 3 years ago

andrius-k commented 3 years ago

Impact of the new feature This request affects all systems that are responsible for harvested DQM data being uploaded to the DQM GUIs. This includes T0 processed DQM data and RelVal/reprocessing DQM data.

Is your feature request related to a problem? Please describe. We're deploying a new, upgraded version of the DQM GUI tool. The procedure which notifies the DQM GUI about the new DQM files is different in a new version. We would like this new procedure to be used along side the the old, visDQMUpload based DQM file upload.

Describe the solution you'd like Now, the DQM data is uploaded to the DQM GUIs using a tool called visDQMUpload. New procedure requires this process to be split into two stages:

If required, a facade could be provided by us (DQM) that would have exactly the same interface as visDQMUpload. In such case, we would only like you to call the facade script (visDQMUpload_new) alongside the old one.

Describe alternatives you've considered No viable, future proof alternatives were found.

Additional context Bellow is a diagram that represents the current Offline DQM file movement:

current-dqm-diagram

Bellow is a diagram that represents the desired Offline DQM file movement, after the changes mentioned in this request:

new-dqm-diagram2

jfernan2 commented 3 years ago

For the records: link to the visDQMUpload tool https://github.com/cms-sw/cmssw/blob/ba6e8604a35283e39e89bc031766843d0afc3240/DQMServices/FileIO/scripts/visDQMUpload.py

jfernan2 commented 3 years ago

For the records 2: the POST should use the following API

https://cmsweb-testbed.cern.ch/dqm/offline-test-new/api/v1/register

HTTP request body:

[{"dataset": "/a/b/c", "run": "123456", "lumi": "0", "file": "/eos/cms/store/group/comm_dqm/DQMGUI_data/location/file.root", "fileformat": 1}]

For more information about this API endpoint (and others), please refer to:

https://github.com/cms-DQM/dqmgui#new-file-registering-endpoint

amaltaro commented 3 years ago

Hi @andrius-k , I'm very sorry for missing this GH issue.

Has this new DQM Gui server been deployed already?

Your proposal looks feasible to me, and it will make the DQMHarvesting process easier and more robust in the long run. We are going to discuss this issue in the coming weeks and come back to you. Thanks

jfernan2 commented 3 years ago

Hi @amaltaro Unfortunately Andrius left CMS so, we take the baton: the new GUI reading the new backend in eos is working since January on https://cmsweb-testbed.cern.ch/dqm/offline-test-new/ which reads the following (temporary?) eos folder: /eos/cms/store/group/comm_dqm/DQMGUI_data Thanks

khurtado commented 2 years ago

Hi @jfernan2 and @andrius-k , I'm working on this and have a question. Basically, the new replacement for visDQMUpload requires now more input parameters, is this right? Before, it needed only the file location. Now, it needs:

Legacy DQM TDirectory based ROOT files (1)
DQMIO TTree based ROOT files (2)
Protobuf based format used in Online live mode (3)

Is there an easy way to tell which one is the right one for a specific root file? Do you have an example of how this information is gotten/used? Also, from what I understand we want to call both the new method and the old visDQMUpload method at this moment, correct?

jfernan2 commented 2 years ago

Hi @khurtado Thanks for looking into this. For the moment you can take Lumi=0 since this is reproducing the current per Run based root files. In the future we might have to upload single root files per LS where lumi will be declared somewhere About file format, you can assume 1 since they are the ones uploaded to the GUI Thank you

jfernan2 commented 2 years ago

All DQM root files produced by Harvest processing are type 1 (plain ROOT), type 2 are DQMIO datasets in DAS not uploaded to the GUI The name of the method could be changed now to visDQMRegister since it is not uploading files to any server but copying them to eos and registering them in DB instead. Thanks https://github.com/cms-DQM/dqmgui

khurtado commented 2 years ago

@jfernan2 Thank you! That helps a lot. One more question from the diagrams. Right now we have:

  1. visDQMUpload: Which uploads ROOT files to vocms0738,39,31

and we want:

  1. Upload DQM ROOT files to EOS
  2. To make this new HTTP post that does not upload files, only registers new files

Do we want to do 1, 2 and 3 and eventually get rid of 1, but not right now? Is 3 dependent on 2? (E.g.: 1 worked, but upload of DQM root files to EOS failed for some reason, do we abort step 3?) Asking mainly because we are just trying to split this work in 2 pieces since there are 2 stages.

jfernan2 commented 2 years ago

Since we want to have the old (legacy) DQM GUI (visDQMUpload) working on parallel until we end the commisioning of this new DQM GUI (visDQMRegister), I would vote for decouple both processes as much as possible.

When we are sure the new DQM GUI and the FW/WMcore workflow chain which makes it work is totally accepted by the Collaboration, we could start decommisioning the old visDQMUpload machinery.

On the visDQMRegister side:

Thanks

khurtado commented 2 years ago

@jfernan2 : I'm testing some changes with the following workflow test: https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=wmagent_DQMHarvesting_LumiMask_khurtado_dqmup_v2_220308_174119_8053

For EOS: It's basically using the WMAgent certificate and trying to write to:

/eos/cms/store/group/comm_dqm/DQMGUI_data/wmagent_DQMHarvesting_LumiMask_khurtado_dqmup_v2_220308_174119_8053/output/DQM_V0001_R000278175__NoBPTX__Run2016F-23Sep2016-v1__DQMIO_0002.root

But it cannot create the parent directory. Would the above be the expected path location to write though? Just double checking. I'm not sure if there is any voms DN mapping that need to be done in order to write to /eos/cms/store/group/comm_dqm/DQMGUI_data. If so, please let me know. Cert used was:

 openssl x509 -in /data/certs/myproxy.pem -noout  -subject
subject= /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=amaltaro/CN=718748/CN=Alan Malta Rodrigues/CN=2073988181/CN=172652124/CN=1063249136/CN=379961315

Log:

2022-03-09 06:38:14,677:INFO:DQMUpload:Writing DQM root files to CERN EOS with retries: 3 and retry pause: 300
2022-03-09 06:38:14,677:INFO:StageOutMgr:==>Working on file: /wmagent_DQMHarvesting_LumiMask_khurtado_dqmup_v2_220308_174119_8053/output/DQM_V0001_R000278175__NoBPTX__Run2016F-23Sep2016-v1__DQMIO_0002.root
2022-03-09 06:38:14,677:INFO:StageOutMgr:===> Attempting 1 Fallback Stage Outs
2022-03-09 06:38:14,677:INFO:StageOutImpl:Creating output directory...
2022-03-09 06:38:14,681:INFO:StageOutImpl:Running the stage out...
2022-03-09 06:38:15,818:INFO:StageOutImpl:Command exited with status: 151
Output message: stdout: Local File Size is: 49421926
Remote File Size is:
ERROR: Size Mismatch between local and SE

stderr: Run: [ERROR] Server responded with an error: [3010] Unable to create parent directory /eos/cms/store/group/comm_dqm/DQMGUI_data/wmagent_DQMHarvesting_LumiMask_khurtado_dqmup_v2_220308_174119_8053/; Operation not permitted^@ (destination)

[ERROR] Server responded with an error: [3011] Unable to stat /eos/cms/store/group/comm_dqm/DQMGUI_data/wmagent_DQMHarvesting_LumiMask_khurtado_dqmup_v2_220308_174119_8053/output/DQM_V0001_R000278175__NoBPTX__Run2016F-23Sep2016-v1__DQMIO_0002.root; No such file or directory^@

[ERROR] Server responded with an error: [3011] Unable to remove /eos/cms/store/group/comm_dqm/DQMGUI_data/wmagent_DQMHarvesting_LumiMask_khurtado_dqmup_v2_220308_174119_8053/output/DQM_V0001_R000278175__NoBPTX__Run2016F-23Sep2016-v1__DQMIO_0002.root; No such file or directory^@
jfernan2 commented 2 years ago

Thanks @khurtado

This is one of the key points of this request: that eos space has been granted to DQM group in the past as a temporary space for the project[1], but has a quota (at present) of 66TB out of which 43.56TB are being used. So, this space may not be a definite solution for the future once you set this workflow running in view of Run3.

Having said that, for your testing purpouses, I believe you should add yourself or Alan Malta if you are using his grid certificate to the following e-group which controls the write access[2], according to [3]. Or I can add you if you prefer, not sure if I should add you or Alan instead.

However, for the long term, it will be needed from WMCore or computing team another eos space to host all the DQM GUI root files, or this same space with larger quota.

Thank you very much

[1] https://twiki.cern.ch/twiki/bin/viewauth/CMS/T2CHCERNEosTeams [2] https://e-groups.cern.ch/e-groups/Egroup.do?egroupName=cms-eos-PPD-DQM&tab=3 [3] eos root://eosproject.cern.ch attr ls /eos/cms/store/group/comm_dqm/ sys.accounting.vos.0="cms" sys.acl="u:22014:rw,u:31275:rw,u:5410:rw,g:1399:!d,egroup:cms-eos-ppd-dqm:rw!d,egroup:cms-eos-ppd-dqm-cleaners:rw+d" sys.forced.blockchecksum="crc32c" sys.forced.blocksize="4k" sys.forced.checksum="adler" sys.forced.layout="replica" sys.forced.nstripes="2" sys.forced.space="default" sys.recycle="/eos/cms/proc/recycle/" user.acl=""

khurtado commented 2 years ago

@jfernan2 Thank you! It would be @amaltaro, to keep consistency with the certs used in the test agent. He will be requesting access to it.

khurtado commented 2 years ago

Hi @jfernan2 ,

So, Alan is part now of the egroup, but I can't still copy to the area. Is there anything else missing? I tried this interactively

[cmst1@vocms0192 ~]$ xrdcp test.txt root://eoscms.cern.ch//eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt
[0B/0B][100%][==================================================][0B/s]
Run: [ERROR] Server responded with an error: [3010] Unable to open file /eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt; Operation not permitted (destination)

[cmst1@vocms0192 ~]$ voms-proxy-info -all
subject   : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=amaltaro/CN=718748/CN=Alan Malta Rodrigues/CN=2073988181/CN=172652124/CN=2101403252/CN=334491060
issuer    : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=amaltaro/CN=718748/CN=Alan Malta Rodrigues/CN=2073988181/CN=172652124/CN=2101403252
identity  : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=amaltaro/CN=718748/CN=Alan Malta Rodrigues
type      : RFC3820 compliant impersonation proxy
strength  : 2048
path      : /data/certs/myproxy.pem
timeleft  : 163:56:40
key usage : Digital Signature, Key Encipherment
=== VO cms extension information ===
VO        : cms
subject   : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=amaltaro/CN=718748/CN=Alan Malta Rodrigues
issuer    : /DC=ch/DC=cern/OU=computers/CN=voms2.cern.ch
attribute : /cms/Role=production/Capability=NULL
attribute : /cms/Role=NULL/Capability=NULL
attribute : /cms/uscms/Role=NULL/Capability=NULL
timeleft  : 163:56:41
uri       : voms2.cern.ch:15002
jfernan2 commented 2 years ago

Hi @khurtado That is very strange, and honestly it scapes my knowledge of the system... Could you please try interactively xrdcp -v -d 9 test.txt root://eoscms.cern.ch//eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt ? I'd like to see the debug messages seeking for a possible mismatch between Alan's username and the one associated to his grid certificate, since for us (DQM) xrdcp command works fine, and a direct cp to eos as well, which for Alan's account (the one linked to his email registedred in the e-group) should work too.

Could you try a direct copy from lxplus using Alan's account or your account after registering to the e-group?

Are you xrdcpying from lxplus or somewhere lese? In my case, from lxplus and with my account lnked to the grid certificate the xrdcp -d 3 gives: [2022-03-15 17:37:22.053623 +0100][Debug ][XRootDTransport ] [eoscms.cern.ch:1094.0] Sending out kXR_login request, username: jfernan, cgi: ?xrd.cc=ch&xrd.tz=1&xrd.appname=xrdcp&xrd.info=&xrd.hostname=lxplus789.cern.ch&xrd.rn=v5.4.1, dual-stack: true, private IPv4: false, private IPv6: false [2022-03-15 17:37:22.057027 +0100][Dump ][XRootD ] [eoscms.cern.ch:1094] Got a kXR_ok response to request kXR_stat (path: /eos/cms/store/group/comm_dqm/DQMGUI_data/text.txt Thanks

khurtado commented 2 years ago

@jfernan2 This is from lxplus (I had to change 9 to 3, since xrdcp complains that's the max debug level): It first tries krb5, then moves to gsi:

env -i X509_USER_PROXY=$PWD/myproxy.pem xrdcp -v -d 3 test.txt root://eoscms.cern.ch//eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt
2022-03-15 18:01:53.440219 +0100][Debug  ][XRootDTransport   ] [eoscms.cern.ch:1094.0] Logged in, session: 8889fd03deae0200a7420000ca4c5304
[2022-03-15 18:01:53.440224 +0100][Debug  ][XRootDTransport   ] [eoscms.cern.ch:1094.0] Authentication is required: &P=krb5,xrootd/eoscms.cern.ch@CERN.CH&P=gsi,v:10400,c:ssl,ca:5168735f.0|4339b4bc.0&P=sss,0.13:/etc/eos.keytab&P=unix
[2022-03-15 18:01:53.440238 +0100][Debug  ][XRootDTransport   ] [eoscms.cern.ch:1094.0] Sending authentication data
[2022-03-15 18:01:53.442406 +0100][Debug  ][XRootDTransport   ] [eoscms.cern.ch:1094.0] Trying to authenticate using krb5
[2022-03-15 18:01:53.442878 +0100][Debug  ][XRootDTransport   ] [eoscms.cern.ch:1094.0] Cannot get credentials for protocol krb5: Seckrb5: No or invalid credentials; No credentials cache found (p=xrootd/eoscms.cern.ch@CERN.CH).
[2022-03-15 18:01:53.445015 +0100][Debug  ][XRootDTransport   ] [eoscms.cern.ch:1094.0] Trying to authenticate using gsi
[2022-03-15 18:01:53.869336 +0100][Dump   ][AsyncSock         ] [eoscms.cern.ch:1094.0] Wrote a message:  (0x840084d0), 136 bytes
[2022-03-15 18:01:53.928655 +0100][Dump   ][XRootDTransport   ] [msg: 0x84007c80] Expecting 4385 bytes of message body
[2022-03-15 18:01:53.928701 +0100][Dump   ][AsyncSock         ] [eoscms.cern.ch:1094.0] Received message header, size: 8
[2022-03-15 18:01:53.928715 +0100][Dump   ][AsyncSock         ] [eoscms.cern.ch:1094.0] Received a message of 4393 bytes
[2022-03-15 18:01:53.928725 +0100][Debug  ][XRootDTransport   ] [eoscms.cern.ch:1094.0] Sending more authentication data for gsi
[2022-03-15 18:01:53.933586 +0100][Dump   ][AsyncSock         ] [eoscms.cern.ch:1094.0] Wrote a message:  (0x840b6780), 15180 bytes
[2022-03-15 18:01:53.943879 +0100][Dump   ][XRootDTransport   ] [msg: 0x84002190] Expecting 0 bytes of message body
[2022-03-15 18:01:53.943929 +0100][Dump   ][AsyncSock         ] [eoscms.cern.ch:1094.0] Received message header, size: 8
[2022-03-15 18:01:53.943936 +0100][Dump   ][AsyncSock         ] [eoscms.cern.ch:1094.0] Received a message of 8 bytes
[2022-03-15 18:01:53.944078 +0100][Debug  ][XRootDTransport   ] [eoscms.cern.ch:1094.0] Authenticated with gsi.
[2022-03-15 18:01:53.944105 +0100][Debug  ][PostMaster        ] [eoscms.cern.ch:1094] Stream 0 connected.
[2022-03-15 18:01:53.944120 +0100][Debug  ][Utility           ] Monitor library name not set. No monitoring
[2022-03-15 18:01:53.944164 +0100][Dump   ][AsyncSock         ] [eoscms.cern.ch:1094.0] Wrote a message: kXR_stat (path: /eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt, flags: none) (0x2225650), 74 bytes
[2022-03-15 18:01:53.944206 +0100][Dump   ][AsyncSock         ] [eoscms.cern.ch:1094.0] Successfully sent message: kXR_stat (path: /eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt, flags: none) (0x2225650).
[2022-03-15 18:01:53.944222 +0100][Dump   ][XRootD            ] [eoscms.cern.ch:1094] Message kXR_stat (path: /eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt, flags: none) has been successfully sent.
[2022-03-15 18:01:53.944228 +0100][Debug  ][ExDbgMsg          ] [eoscms.cern.ch:1094] Moving MsgHandler: 0x222dcb0 (message: kXR_stat (path: /eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt, flags: none) ) from out-queu to in-queue.
[2022-03-15 18:01:53.944241 +0100][Dump   ][PostMaster        ] [eoscms.cern.ch:1094.0] All messages consumed, disable uplink

However, I get the permission denied:

[2022-03-15 17:59:35.423151 +0100][Dump   ][Utility           ] Path:      /eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt
[2022-03-15 17:59:35.423191 +0100][Debug  ][File              ] [0x1737bc0@root://eoscms.cern.ch:1094//eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt?oss.asize=10&xrdcl.requuid=da4053e1-383e-45e6-8f2c-19731aa88a1f] Sending an open command
[2022-03-15 17:59:35.423213 +0100][Dump   ][XRootD            ] [eoscms.cern.ch:1094] Sending message kXR_open (file: /eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt?oss.asize=10, mode: 0644, flags: kXR_new kXR_open_updt kXR_async kXR_retstat )
[2022-03-15 17:59:35.423231 +0100][Debug  ][ExDbgMsg          ] [eoscms.cern.ch:1094] MsgHandler created: 0x1738c10 (message: kXR_open (file: /eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt?oss.asize=10, mode: 0644, flags: kXR_new kXR_open_updt kXR_async kXR_retstat ) ).
[2022-03-15 17:59:35.423243 +0100][Dump   ][PostMaster        ] [eoscms.cern.ch:1094] Sending message kXR_open (file: /eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt?oss.asize=10, mode: 0644, flags: kXR_new kXR_open_updt kXR_async kXR_retstat ) (0x1738570) through substream 0 expecting answer at 0
[2022-03-15 17:59:35.423283 +0100][Dump   ][AsyncSock         ] [eoscms.cern.ch:1094.0] Wrote a message: kXR_open (file: /eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt?oss.asize=10, mode: 0644, flags: kXR_new kXR_open_updt kXR_async kXR_retstat ) (0x1738570), 87 bytes
[2022-03-15 17:59:35.423308 +0100][Dump   ][AsyncSock         ] [eoscms.cern.ch:1094.0] Successfully sent message: kXR_open (file: /eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt?oss.asize=10, mode: 0644, flags: kXR_new kXR_open_updt kXR_async kXR_retstat ) (0x1738570).
[2022-03-15 17:59:35.423316 +0100][Dump   ][XRootD            ] [eoscms.cern.ch:1094] Message kXR_open (file: /eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt?oss.asize=10, mode: 0644, flags: kXR_new kXR_open_updt kXR_async kXR_retstat ) has been successfully sent.
[2022-03-15 17:59:35.423322 +0100][Debug  ][ExDbgMsg          ] [eoscms.cern.ch:1094] Moving MsgHandler: 0x1738c10 (message: kXR_open (file: /eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt?oss.asize=10, mode: 0644, flags: kXR_new kXR_open_updt kXR_async kXR_retstat ) ) from out-queu to in-queue.
[2022-03-15 17:59:35.423328 +0100][Dump   ][PostMaster        ] [eoscms.cern.ch:1094.0] All messages consumed, disable uplink
[2022-03-15 17:59:35.424149 +0100][Dump   ][XRootDTransport   ] [msg: 0xdc000ac8] Expecting 100 bytes of message body
[2022-03-15 17:59:35.424171 +0100][Dump   ][AsyncSock         ] [eoscms.cern.ch:1094.0] Received message header for 0xdc000ac8 size: 8
[2022-03-15 17:59:35.424185 +0100][Dump   ][AsyncSock         ] [eoscms.cern.ch:1094.0] Received message 0xdc000ac8 of 108 bytes
[2022-03-15 17:59:35.424191 +0100][Dump   ][PostMaster        ] [eoscms.cern.ch:1094] Handling received message: 0xdc000ac8.
[2022-03-15 17:59:35.424254 +0100][Dump   ][XRootD            ] [eoscms.cern.ch:1094] Got a kXR_error response to request kXR_open (file: /eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt?oss.asize=10, mode: 0644, flags: kXR_new kXR_open_updt kXR_async kXR_retstat ) [3010] Unable to open file /eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt; Operation not permitted
[2022-03-15 17:59:35.424284 +0100][Debug  ][XRootD            ] [eoscms.cern.ch:1094] Handling error while processing kXR_open (file: /eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt?oss.asize=10, mode: 0644, flags: kXR_new kXR_open_updt kXR_async kXR_retstat ): [ERROR] Error response: permission denied.
[2022-03-15 17:59:35.424293 +0100][Debug  ][ExDbgMsg          ] [eoscms.cern.ch:1094] Calling MsgHandler: 0x1738c10 (message: kXR_open (file: /eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt?oss.asize=10, mode: 0644, flags: kXR_new kXR_open_updt kXR_async kXR_retstat ) ) with status: [ERROR] Error response: permission denied.
jfernan2 commented 2 years ago

Hi @khurtado In my case I get authenticated with krb5 successfully. In your log I see krb5 fails but then gsi makes it: [2022-03-15 18:01:53.944078 +0100][Debug ][XRootDTransport ] [eoscms.cern.ch:1094.0] Authenticated with gsi.

I have tried to do the same with my secondary lxplus account, which is not linked to my grid certificate, to decouple: I authenticate successfully with krb5 but then copy fails with the same message as you.

I suspect the issue is linked to the fact that you are using Alan's grid certificate from your lxplus account. Can you do a klist command?

From my (primary) successful account I get: Valid starting Expires Service principal 03/15/2022 18:33:49 03/16/2022 19:33:49 krbtgt/CERN.CH@CERN.CH renew until 03/20/2022 18:33:49 03/15/2022 18:33:49 03/16/2022 19:33:49 afs/cern.ch@CERN.CH renew until 03/20/2022 18:33:49 03/15/2022 18:33:51 03/16/2022 19:33:49 xrootd/eoscms.cern.ch@CERN.CH renew until 03/20/2022 18:33:49 03/15/2022 18:33:56 03/16/2022 19:33:49 xrootd/eoshome.cern.ch@CERN.CH renew until 03/20/2022 18:33:49 03/15/2022 18:33:56 03/16/2022 19:33:49 xrootd/eosproject-i02.cern.ch@CERN.CH renew until 03/20/2022 18:33:49

From my (secondary) non-successful account I get: Valid starting Expires Service principal 03/15/2022 18:28:48 03/16/2022 19:28:48 krbtgt/CERN.CH@CERN.CH renew until 03/20/2022 18:28:48 03/15/2022 18:28:48 03/16/2022 19:28:48 afs/cern.ch@CERN.CH renew until 03/20/2022 18:28:48 03/15/2022 18:30:15 03/16/2022 19:28:48 xrootd/eoscms.cern.ch@CERN.CH renew until 03/20/2022 18:28:48

I understand gsi authentication is ignored and we need krb5 against eoshome and eosproject too

khurtado commented 2 years ago

@jfernan2 Ah, I see. So the authentication is working only with kerberos. Which means it will probably work from Alan's account itself with his kerberos. I did kdestroy so that it wouldn't try to use my kerberos credentials and only use the GSI/proxy credential. eoshome and eosproject do need krb5, yes, but I thought eoscms could work with gsi.

I suspect the agents, which run jobs using the cmst1 account won't have Alan's kerberos credentials or any kerberos at all from the condor jobs, only the proxy, so I was expecting xrdcp to do the authentication using the grid certificate alone.

@amaltaro Do you know how we write to e.g.: root://eoscms.cern.ch//eos/cms/store/logs/prod/recent/TESTBED ? Is it done only with GSI authentication?

jfernan2 commented 2 years ago

That's even stranger since now, I did a kdestroy from my secondary account and I was able to do xrdcp after gsi authentication ONLY :-S So, indeed gsi is also working, may it be that Alan's grid certificate is not mapped to his cern account somehow so that e-group is not considering it?

khurtado commented 2 years ago

The certificate is giving read access only for some reason. That means the mapping to the user is working, but the e-group portion is not recognized to give write permissions is not, it's hard to tell what is going on without knowing what the EOS configuration is. Is this something Service Desk at CERN is supposed to help with?

@amaltaro: Is it okay if I create a directory e.g.::

/eos/cms/store/logs/prod/recent/TESTBED/DQMGUI

For the tests, since we do have write access to that location? If so, I think we can just ignore the issues with the other path, since it's for temporary tests only.

jfernan2 commented 2 years ago

Hi @khurtado I am not sure about your statement since I was able to write using my secondary account through gsi authentication only, but it is true CERN IT should be able to solve this Thanks

khurtado commented 2 years ago

@jfernan2 Thank you! I will just create a new directory for the tests inside, following Aln's suggestion: /eos/cms/store/unmerged/DQMGUI

khurtado commented 2 years ago

@jfernan2 I'm getting closer here. I have one question regarding the API. In this format:

[{"dataset": "/a/b/c", "run": "123456", "lumi": "0", "file": "/eos/cms/store/group/comm_dqm/DQMGUI_data/location/file.root", "fileformat": 1}]

How do I pass a range of runs or lumis in the request? This is what I see from WMCore:

'runAndLumis': {278175: [[70, 90]]}}

So, if we have a DQMHarvest workflow set to multiRun harvesting, it can have multiple runs and lumis associated with each run.

For example: This template: https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_DQMHarvesting_MultiRun_HG2202_Val_220203_213001_614 Had a job with the following run and lumis:

runAndLumis': {277981: [[1, 82], [84, 158]], 278017: [[1, 589]], 277932: [[1, 12], [14, 15], [17, 127]], 278193: [[1, 239]]}}

EDIT:

Ah, wait, would it be many dictionaries, 1 for each run and single lumi and the same filename per dictionary (and fileformat and datasetname)? E.g.:

[{run:"X", lumi:"1", file:"filename1"},{run:"X", lumi:"2", file:"filename1"},{run:"Y", lumi:"1",file:"filename1"}]
jfernan2 commented 2 years ago

@khurtado Current DQM GUI is only able to show DQM root files per RUN. In the future is expected to be able to handle per LS too.

This implies that, a single root file uploaded (now copied to eos and registered to the DB) can only contain a single run or a single LS (of a run). Hence, per RUN root files should be registered with Lumi=0, since this is reproducing the current per Run based root files. Once we were able to show in the GUI per LS data, runs will be registered with a single LS, not several, since plots per LS must be displayed.

Please note that same RUN (and LS) may be associated to severl datasets, but a single file each, like in: https://github.com/cms-DQM/dqmgui#api-documentation

Multirun harvesting root files are a special case of DQM root files; in this case, since all stats are harvested in a single root file and the GUI is not able to display the runs it contains (they are embedded in the dataset name or config which has produced it), for teh GUI runNumber is always forced/set to 999999 for data (and to 1 for MC as any MC). See: https://github.com/dmwm/WMCore/pull/9746 and https://github.com/dmwm/WMCore/issues/9690

Bottomline, in principle you should not copy and register the same file for more than one RUN or LS.

@ahmad3213 @emanueleusai @rvenditti please consider to correct me at any point since you are the official DQM conveners, I am not DQM convener since 31st Dec 2021. Thanks

khurtado commented 2 years ago

@jfernan2 Thank you!

Here is one workflow test with the current changes: https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=wmagent_DQMHarvesting_LumiMask_khurtado_dqm_v10_220320_210531_8322

ROOT files were uploaded here (I'm using the unmerged area as a temporary path):

/eos/cms/store/unmerged/DQMGUI/wmagent_DQMHarvesting_LumiMask_khurtado_dqm_v10_220320_210531_8322

And the POST call to the register site looks like this:

2022-03-21 04:03:18,643:INFO:DQMUpload:HTTP Upload is about to start:
 => URL: https://cmsweb-testbed.cern.ch/dqm/offline-test-new/api/v1/register
 => Filename: /eos/cms/store/unmerged/DQMGUI/wmagent_DQMHarvesting_LumiMask_khurtado_dqm_v10_220320_210531_8322/output/DQM_V0001_R000277991__NoBPTX__Run2016F-23Sep2016-v1__DQMIO_0001.root

2022-03-21 04:03:18,643:INFO:DQMUpload:Using proxy file: /srv/myproxy.pem
2022-03-21 04:03:18,643:INFO:DQMUpload:Using CA certificate path: None
2022-03-21 04:03:18,643:INFO:DQMUpload:HTTP Register POST arguments: [{'file': '/eos/cms/store/unmerged/DQMGUI/wmagent_DQMHarvesting_LumiMask_khurtado_dqm_v10_220320_210531_8322/output/DQM_V0001_R000277991__NoBPTX__Run2016F-23Sep2016-v1__DQMIO_0001.root', 'dataset': '/NoBPTX/Run2016F-23Sep2016-v1/DQMIO', 'run': 277991, 'lumi': 0, 'fileformat': 1}]

2022-03-21 04:03:18,657:INFO:HTTPSAuthHandler:Found 149 default trusted CA certificates.
2022-03-21 04:03:18,657:INFO:HTTPSAuthHandler:SSL context manager created with the following settings:
2022-03-21 04:03:18,657:INFO:HTTPSAuthHandler:  check_hostname : True
2022-03-21 04:03:18,657:INFO:HTTPSAuthHandler:  options : Options.OP_ALL|OP_NO_SSLv3|OP_NO_SSLv2|OP_CIPHER_SERVER_PREFERENCE|OP_SINGLE_DH_USE|OP_SINGLE_ECDH_USE|OP_NO_COMPRESSION
2022-03-21 04:03:18,657:INFO:HTTPSAuthHandler:  protocol : _SSLMethod.PROTOCOL_TLS
2022-03-21 04:03:18,657:INFO:HTTPSAuthHandler:  verify_flags : VerifyFlags.VERIFY_X509_TRUSTED_FIRST
2022-03-21 04:03:18,657:INFO:HTTPSAuthHandler:  verify_mode : VerifyMode.CERT_REQUIRED
2022-03-21 04:03:18,886:INFO:DQMUpload:HTTP POST to register url finished succesfully with response:
  Status code: 201

I do see some DQM histograms associated with that Run and Dataset here: https://cmsweb-testbed.cern.ch/dqm/offline-test-new/?folder_path=DQM%2FTimerService&dataset_name=%2FNoBPTX%2FRun2016F-23Sep2016-v1%2FDQMIO&run_number=277991&workspaces=Everything&overlay=overlay&normalize=true&lumi=0

but it's unclear to me how to tell if they came from what I registered in this test or if they were there before. Anyway, does this look well to you?

khurtado commented 2 years ago

@jfernan2 And this is for multiRun: https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=wmagent_DQMHarvesting_MultiRun_khurtado_dqm_v11_220321_111702_196

The run number will be 999999 for current multiRun workflows but should start showing 999999 or 1 accordingly for new workflows once #9746 is applied.

2022-03-21 11:18:57,248:INFO:DQMUpload:HTTP Upload is about to start:
 => URL: https://cmsweb-testbed.cern.ch/dqm/offline-test-new/api/v1/register
 => Filename: /eos/cms/store/unmerged/DQMGUI/wmagent_DQMHarvesting_MultiRun_khurtado_dqm_v11_220321_111702_196/output/DQM_V0001_R000999999__NoBPTX__Run2016F-23Sep2016-v1-277932-278193__DQMIO_0001.root

2022-03-21 11:18:57,248:INFO:DQMUpload:Using proxy file: /srv/myproxy.pem
2022-03-21 11:18:57,248:INFO:DQMUpload:Using CA certificate path: None
2022-03-21 11:18:57,248:INFO:DQMUpload:HTTP Register POST arguments: [{'file': '/eos/cms/store/unmerged/DQMGUI/wmagent_DQMHarvesting_MultiRun_khurtado_dqm_v11_220321_111702_196/output/DQM_V0001_R000999999__NoBPTX__Run2016F-23Sep2016-v1-277932-278193__DQMIO_0001.root', 'dataset': '/NoBPTX/Run2016F-23Sep2016-v1/DQMIO', 'run': 999999, 'lumi': 0, 'fileformat': 1}]

2022-03-21 11:18:57,273:INFO:HTTPSAuthHandler:Found 149 default trusted CA certificates.
2022-03-21 11:18:57,273:INFO:HTTPSAuthHandler:SSL context manager created with the following settings:
2022-03-21 11:18:57,273:INFO:HTTPSAuthHandler:  check_hostname : True
2022-03-21 11:18:57,274:INFO:HTTPSAuthHandler:  options : Options.OP_ALL|OP_NO_SSLv3|OP_NO_SSLv2|OP_CIPHER_SERVER_PREFERENCE|OP_SINGLE_DH_USE|OP_SINGLE_ECDH_USE|OP_NO_COMPRESSION
2022-03-21 11:18:57,274:INFO:HTTPSAuthHandler:  protocol : _SSLMethod.PROTOCOL_TLS
2022-03-21 11:18:57,274:INFO:HTTPSAuthHandler:  verify_flags : VerifyFlags.VERIFY_X509_TRUSTED_FIRST
2022-03-21 11:18:57,274:INFO:HTTPSAuthHandler:  verify_mode : VerifyMode.CERT_REQUIRED
2022-03-21 11:18:57,392:INFO:DQMUpload:HTTP POST to register url finished succesfully with response:
  Status code: 201
jfernan2 commented 2 years ago

Sorry @khurtado but I am not following you, at least not completely:

Here is one workflow test with the current changes: https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=wmagent_DQMHarvesting_LumiMask_khurtado_dqm_v10_220320_210531_8322

What do you mean with current changes? Changes in the script you are creating to acomplish this task?

https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=wmagent_DQMHarvesting_LumiMask_khurtado_dqm_v10_220320_210531_8322

This workflow is weird, CMSSW_8_0_20 ? And it has several runs in it, I understand there is a DQM root file per run

And the POST call to the register site looks like this:

The post looks OK, in principle

but it's unclear to me how to tell if they came from what I registered in this test or if they were there before. Anyway, does this look well to you?

They look OK, but bear in mind that since you did several tests, last upload/register is the one which will be displayed

@jfernan2 And this is for multiRun: https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=wmagent_DQMHarvesting_MultiRun_khurtado_dqm_v11_220321_111702_196 The run number will be 999999 for current multiRun workflows but should start showing 999999 or 1 accordingly for new workflows once #9746 is applied.

I am confused here: PR #9746 is almost two year old, not yet merged (plans?) and it keeps saying that run Number = 1 only will only be assigned to MC, but never to data. On the other hand, for data, run = 999999, info for harvested runs is lost, I would have expected to keep it on the dataset name somehow, otherwise another multirun harvesting (MRH) for the same dataset will overwrite this one.

For me MRH is a nasty task, in the sense that it may be subdetector dependent since the list of runs for a given dataset may vary from one DPG to another. It has been encouraged in the past that every DPG makes its own MRH for that reason. On the other hand, current DQM GUI will have problems displaying it since it was designed for single run uploads.

If you want to register brand new data which is not in the GUI yet, so that you can ensure that the plots you see come from your last registering, perhaps you could use current dataset: /Cosmics/Commissioning2022-PromptReco-v1/DQMIO

Thanks a lot

khurtado commented 2 years ago

@jfernan2

Sorry @khurtado but I am not following you, at least not completely:

Here is one workflow test with the current changes: https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=wmagent_DQMHarvesting_LumiMask_khurtado_dqm_v10_220320_210531_8322

What do you mean with current changes? Changes in the script you are creating to acomplish this task?

Yes, in this PR: https://github.com/dmwm/WMCore/pull/11015

https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=wmagent_DQMHarvesting_LumiMask_khurtado_dqm_v10_220320_210531_8322

This workflow is weird, CMSSW_8_0_20 ? And it has several runs in it, I understand there is a DQM root file per run

Yes, it has 1 DQM root file per run. This workflow is one of the WMCore ReqMgr templates we have for testing: https://github.com/dmwm/WMCore/blob/master/test/data/ReqMgr/requests/Integration/DQMHarvesting_MultiRun.json

And the POST call to the register site looks like this:

The post looks OK, in principle

but it's unclear to me how to tell if they came from what I registered in this test or if they were there before. Anyway, does this look well to you?

They look OK, but bear in mind that since you did several tests, last upload/register is the one which will be displayed

That sounds good, thanks!

@jfernan2 And this is for multiRun: https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=wmagent_DQMHarvesting_MultiRun_khurtado_dqm_v11_220321_111702_196 The run number will be 999999 for current multiRun workflows but should start showing 999999 or 1 accordingly for new workflows once #9746 is applied.

I am confused here: PR #9746 is almost two year old, not yet merged (plans?) and it keeps syaing that run Number = 1 only will only be assigned to MC, but never to data. On the other hand, for data, run = 999999, info for harvested runs is lost, I would have expected to keep it on the dataset name somehow, otherwise another multirun harvesting (MRH) for the same dataset will overwrite this one.

For me MRH is a nasty task, in the sense that it may be subdetector dependent since the list of runs for a given dataset may vary from one DPG to another. It has been encouraged in the past that every DPG makes its own MRH for that reason. On the other hand, current DQM GUI will have problems displaying it since it was designed for single run uploads.

If you want to register brand new data which is not in the GUI yet, so that you can ensure that the plots you see come from your last registering, perhaps you could use current dataset: /Cosmics/Commissioning2022-PromptReco-v1/DQMIO

Thanks a lot

Ah, okay. So, right now this is using the 999999 thing which apparently would be useless. I can alternatively just NOT call the new DQMUI register or copy the file to EOS at all if the harvesting job is multiRun, since it is not supported anyway (so, we would just keep the old visDQMGUI call). How does that sound?

jfernan2 commented 2 years ago

I can alternatively just NOT call the new DQMUI register or copy the file to EOS at all if the harvesting job is multiRun, since it is not supported anyway (so, we would just keep the old visDQMGUI call). How does that sound?

How is current visDQMGUI treating MRH files?

khurtado commented 2 years ago

I can alternatively just NOT call the new DQMUI register or copy the file to EOS at all if the harvesting job is multiRun, since it is not supported anyway (so, we would just keep the old visDQMGUI call). How does that sound?

How is current visDQMGUI treating MRH files?

From the WMCore side, the HTTP post for visDQMGUI only asks for the full path filename in the worker node. No info on Run numbers or lumis, so there is no distinction with ByRun mode files in that sense.

jfernan2 commented 2 years ago

Then, it is basing its DB registering on the Run Number from the DQM root file name, just as the new GUI. No info about the run numbers contained in dataset or file name

khurtado commented 2 years ago

@jfernan2 Then, it seems MHR is not supported overall. Would you prefer to skip upload to EOS and registration for MHR or to keep it in the 999999 (and 1 when using agents with #9746 merged) reporting mode? The first option makes more sense to me since the information is not useful for the reasons you exposed, the second options sort of replicates the current behavior with the visDQMUpload that has been there for a long time.

I think any improvements to support MHR should be a different github issue.

jfernan2 commented 2 years ago

@khurtado I would keep the behaviour we have in the old GUI to be coherent, otherwiser MRH files would not be accessible in the new GUI

On the other hand, I don't see why you keep claimng that MRH data will have runNumber=1 after #9746

RunNumber=1 is reserved for MC, 999999 for MRH data

See: https://github.com/dmwm/WMCore/pull/9746/files#diff-3c13cdc9485083bb43b4e4d3d37f7310b878d36bc137ce2a7cf8f08de4e9daf0R176-R181

Do you agree?

khurtado commented 2 years ago

@jfernan2 Ah, yes. What I meant is, without the PR, RunNumber is always 999999 in any case. The PR makes it either 999999 (Data harvesting) or 1 (MC). Is that correct?

jfernan2 commented 2 years ago

Just to be clear, please correct me if I am wrong: right now, before #9746 is merged,

I am not sure how many of these have been uploaded to the DQM GUI, I can only find one of those in the development GUI, none in the Offline GUI. This one: https://tinyurl.com/ycj7luc9

which has RunNumber forced as 999999 in the DQM search box despite there is a mismatch between this and the runNumber displayed in the Menu of the DQM GUI (278017, the longest one in the range?), but dataset name keeps the run range used in the harvesting: /NoBPTX/Run2016F-23Sep2016-v1-277932-278193/DQMIO

This would be the desired behaviour for MRH in DQM GUI, so that DQM user can trace back directly from dataset name, which runs (a range) it contains, despite the search is performed by run = 999999 in the DQM search.

I see several ALCAPROMPT datasets uploaded in this way into the Offline DQM GUI too, all of them with runNumber forced to 999999, but different dataset name and different run displayed in the header of the GUI. E.g. /StreamExpress/Run2018A-PromptCalibProdSiStripGainsAAG-Express-v1-316702-316766/ALCAPROMPT https://tinyurl.com/yaz6vfyt So that they can be distinguished by dataset name (run range) and even by displayed Run Number (in the header of the GUI) despite all have 9999999

@ahmad3213 @emanueleusai @rvenditti please speak either if you agree or disagree

Thanks

[1] https://github.com/dmwm/WMCore/pull/9746/files#diff-3c13cdc9485083bb43b4e4d3d37f7310b878d36bc137ce2a7cf8f08de4e9daf0L181-R184

khurtado commented 2 years ago

@jfernan2 That is a good point and I think you should make that comment on #9646 itself. But looking at the comments, it seems that PR is not finished and won't be merged right now, since Run dependent MC really has run number > 1, according to the comments.

For #11015, which is the PR that would solve this particular issue, we won't lose the dataset name tweak, so whatever is in there for changing the dataset name, will remain in place after merging this PR (note #11015 still needs review and change the EOS path to a permanent one though, so it won't be merged yet)

rvenditti commented 2 years ago

Hi @jfernan2 thanks for following this. We (DQMDC conveners) totally agree with you. @khurtado could you please let us know what is missing to complete the automation process? Moreover, let us know when you plan to do the next test, so that we can choose a suitable file (that is not present in the GUI). FYI, we just finished to upload by hands the rootfiles of latest runs.

khurtado commented 2 years ago

@rvenditti Yes, I think the basic functionality is there. The missing pieces are:

khurtado commented 2 years ago

@rvenditti @jfernan2 On the EOS storage topic. How much space do you think would be needed for this?

khurtado commented 2 years ago

@rvenditti @jfernan2 Do we need to support multi urls for the registration site? Or can we safely assume it will only be 1 single register url in the future? Current changes have it so that it supports multi url, but for the sake of simplicity, we could remove that if it's not going to be needed.

jfernan2 commented 2 years ago

@khurtado On the EOS space, the current quota we have in /eos/cms/store/group/comm_dqm is 66TB out of which we have used 67% including this service (new DQM Offline GUI for Run2 and LS2 data) and others.

Run2 through all data takings in between specially in LS2 account so far for ~7TB This is data taking, apart we would have RelVals which so far up to CMSSW_12_3_X account for 3.1TB

Since this is a long term project where all DQM root files for Run3 and beyond may be stored, it is difficult to estimate a final need at this point, however a rough estimate could be 100TB at the moment. So, at present we could be using /eos/cms/store/group/comm_dqm once you solve the authentication problems, at least as a temporal solution

About urls, right now new OFFLINE GUI lives on: https://cmsweb-testbed.cern.ch/dqm/offline-test-new/ This url may change a bit once commisioning ends with this addition you are working on. Another url/GUI should be created for RelVal and Development (used in Jenkins test from PR comparisons) GUI instances.

khurtado commented 2 years ago

@jfernan2 Thank you for the details! Regarding the URL, what I meant more is if we would ever be required to report to more than 1 register URL at a time.

For example, for a given ROOT files, report to URL1 and URL2 in the form

url="URL1;URL2

We have a loop to support this below: https://github.com/dmwm/WMCore/pull/11015/files/ee8c9fe76301997917e938b998393760893de7cf#diff-28545267cb6f40f2c39bdc92f389e86d92ddd36355037cc3df162b199606a16eR454

, but @amaltaro made the suggestion of removing this feature for simplicity if we only expect to work with 1 single URL (even if that single url changes over time)

jfernan2 commented 2 years ago

ah ok, just a single url is fine then :-) Thanks

khurtado commented 2 years ago

ah ok, just a single url is fine then :-) Thanks

Sounds good, thanks!

khurtado commented 2 years ago

@jfernan2 One question on this. Is this a single URL that is no workflow dependent? Anything special in case of T0 workflows (like, do they use this new DQM mechanism same as non-T0 DQM stuff)? See the question below from Alan:

While reviewing it, it just occurred to me that T0 also uses this code. So we need to verify whether T0 would need to use this same mechanism; or whether we need to work on workflow configuration that could enable/disable this new DQM feature. Can you please follow this up with the DQM team?

jfernan2 commented 2 years ago

Hi @khurtado let me put it this way in order to clarify:

amaltaro commented 2 years ago

Hi @jfernan2, thank you for this clarification.

Changes proposed by Kenyi are almost ready to be merged, but there are two crucial details to be sorted before it goes in: 1) is T0 supposed to adopt to same "transient" DQM workflow? Meaning, i) upload the root file to the DQM Gui; ii) stage the root file to the EOS area; iii) register the dataset/run/lumi against the DQM register system? 2) we would like to avoid adding another configurable register URL to every single workflow in the system. So, I wonder if we could come up with endpoint names that we could derive from the current/old DQM Gui? For instance:

Do you think it would be feasible? If not, then there is still a substantial development to be done to support/validate the new argument at the workflow level and propagate it all the way down to the job runtime code.

jfernan2 commented 2 years ago

Hi @amaltaro

  1. Yes, at least until we can decommission the old GUI, when step i) will be dropped
  2. If using https://cmsweb.cern.ch/dqm/offline for i) and https://cmsweb-testbed.cern.ch/dqm/offline-test-new/api/v1/register for iii) causes you trouble, we can try to convince cmsweb to move https://cmsweb-testbed.cern.ch/dqm/offline-test-new/api/v1/register to https://cmsweb.cern.ch/dqm/offline-new/api/v1/register or similar. I do not fully understand you why it makes a problem two different urls since the task are coupled but independent.

Bear in mind that the two GUIs (new and old) live in different machines, cmsweb crew offered this kind of provisional url for the new as a temporal place. So, i) is doing the job for the old GUI (which after receiving the file, the machine is registering it in its own internal database, that is why there is no API), while ii) + iii) is doing the same for the new GUI, hence the different urls

For RelVal GUI, since it has not been settled yet, we can for sure make it more legacy ompliant, since old RelVal GUI is on cmsweb (not in cmsweb-testbed). Thanks

amaltaro commented 2 years ago

Thanks for these answers.

Okay, so once these changes take place, the T0 machinery will start adopting the new DQM Gui procedure as well. FYI @germanfgv @jhonatanamado

Regarding the urls, the best way to deal with these multiple urls would be creating a new workflow/spec parameter, but that would require much more changes in WMCore, as well as in McM and PyReleaseValidation when creating workflows.

What we have right now is 3 instances of the DQM Gui: a) offline b) relval c) dev

and EACH of those instances is available both in the cmsweb-testbed and the cmsweb production clusters.

The question we need to answer is, for any given workflow, which DQM register URL it has to use? Given that it's going to be something transient, I was considering in having a map between the OLD DQM url to the NEW DQM url. Can we agree on such convention/map? Maybe something like (where key is the old url, and value is the new one):

{"https://cmsweb.cern.ch/dqm/offline": "https://cmsweb.cern.ch/dqm/offline-new/api/v1/register",
 "https://cmsweb.cern.ch/dqm/relval": "https://cmsweb.cern.ch/dqm/relval-new/api/v1/register",
 "https://cmsweb.cern.ch/dqm/dev": "https://cmsweb.cern.ch/dqm/dev-new/api/v1/register"}

? We might have to loop in the CMSWEB team to see whether this would be possible as well.

In addition to that - and if we come up with a logic for the question above this comment becomes useless - I would not recommend to rely on cmsweb-testbed in production workflows. If anything happens to that cluster, we would be failing production workflows.

If you prefer, we could try to organize a zoom call between you/me/Kenyi tomorrow to discuss these details as well.

jfernan2 commented 2 years ago

Thanks @amaltaro From DQM side there is no problem to move from cmsweb-testbed to non-testbed. I guess it is just a matter of interacting with cmsweb crew. I let DQM conveners to comment here

rvenditti commented 2 years ago

Hi, yes I confirm that for the DQM-DC team, there is no problem in the migration of the offline gui from test bed to non-test bed. Is this migration already in the process to happen or do you still need to discuss the details with us?