cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.29k forks source link

Bug in frontier client leading to failed CMS jobs at a KIT subsite #40701

Open ArturAkh opened 1 year ago

ArturAkh commented 1 year ago

Dear CMSSW experts,

At one of our subsites at KIT, we have encountered a number of failed jobs (example), which have the following error:

Setting up Frontier log level
Beginning CMSSW wrapper script
 slc7_amd64_gcc700 scramv1 CMSSW
Performing SCRAM setup...
Completed SCRAM setup
Retrieving SCRAM project...
Completed SCRAM project
Executing CMSSW
cmsRun  -j FrameworkJobReport.xml PSet.py
%MSG-i ThreadStreamSetup:  (NoModuleName) 04-Feb-2023 08:56:27 UTC pre-events
setting # threads 4
setting # streams 4
%MSG
error [fn-urlparse.c:59]: config error: bad url 10.3.0.123:3128
error [fn-urlparse.c:59]: config error: bad url 10.3.0.123:3128
error [fn-urlparse.c:59]: config error: bad url 10.3.0.123:3128
error [fn-urlparse.c:59]: config error: bad url 10.3.0.123:3128
error [fn-urlparse.c:59]: config error: bad url 10.3.0.123:3128
error [fn-urlparse.c:59]: config error: bad url 10.3.0.123:3128
error [fn-urlparse.c:59]: config error: bad url 10.3.0.123:3128
----- Begin Fatal Exception 04-Feb-2023 08:57:39 UTC-----------------------
An exception of category 'StdException' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing ESSource: class=PoolDBESSource label='GlobalTag'
Exception Message:
A std::exception was thrown.
Connection on "frontier://(loadbalance=proxies)(proxyconfigurl=file:///etc/wpad.dat)(backupproxyurl=http://cmsbpfrontier.cern.ch:3128)(backupproxyurl=http://cmsbproxy.fnal.gov:3128)(serverurl=http://cmsfrontier.cern.ch:8000/FrontierProd)(serverurl=http://cmsfrontier1.cern.ch:8000/FrontierProd)(serverurl=http://cmsfrontier2.cern.ch:8000/FrontierProd)(serverurl=http://cmsfrontier3.cern.ch:8000/FrontierProd)/CMS_CONDITIONS" cannot be established ( CORAL : "ConnectionPool::getSessionFromNewConnection" from "CORAL/Services/ConnectionService" )
----- End Fatal Exception -------------------------------------------------
Complete
process id is 9962 status is 66

This has happened for jobs running with CMSSW_10_2_16_UL release.

According to our investigation, it seems to be a bug in frontier client in tag cms/2.8.20:

https://github.com/cms-externals/frontier_client/blob/e96f07fe14a188580470cbbd27ad3fc9b458b5ca/http/fn-urlparse.c#L57-L62

The client expects a http in front of the IP or hostname, which is contrary to what is written in the PAC specification. It is fixed in tag cms/2.9.1

So we assume, that CMS would require a new patch release picking up the new tag.

Thank you very much in advance for having a look into this.

Artur Gottmann

cmsbuild commented 1 year ago

A new Issue was created by @ArturAkh Artur Gottmann.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Dr15Jones commented 1 year ago

Assign core

cmsbuild commented 1 year ago

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 1 year ago

Thanks @ArturAkh for reporting and diagnosing the problem.

@smuzaffar I suppose updating frontier in 10_6_X could be fairly straightforward (fingers crossed). That would imply a new release (would be 10_6_33 as of today).

My understanding is that updating the CMSSW release in submitted workflows would be tedious (but I'll include @cms-sw/pdmv-l2 to confirm), and therefore, in the leading order, the fix could be included only in new workflow submission.

@ArturAkh Do you know if it would be easy for KIT to not accept workflows that use specific CMSSW versions? (just wondering possible stopgap measures)

ArturAkh commented 1 year ago

Hi @makortel,

Thanks a lot for taking care of this!

In principle, we could reject workflows based on CMSSW version.

However, since we have seen that problem on a minor subsite at KIT, so in particular not at the Tier 1 resources, it would't be a major issue to wait for the switch to new workflows.

Currently, we don't observe the problem - it appears on the subsite in question from time to time.

Cheers,

Artur

smuzaffar commented 1 year ago

@smuzaffar I suppose updating frontier in 10_6_X could be fairly straightforward (fingers crossed). That would imply a new release (would be 10_6_33 as of today).

@makortel , 10.6.X is already using frontierclient 2.9.1 . Looks like we need to update it for 10.2.X release cycle but it should be doable (it deps on just expat openssl pacparser python zlib) , lets hope the version 2.9.1 works for slc6 :-)

makortel commented 1 year ago

10.6.X is already using frontierclient 2.9.1 .

Oh nice.

Looks like we need to update it for 10.2.X release cycle but it should be doable

Ah right, somehow missed the description mentioning 10_2_16_UL. I believe we should update the 8_0_X and 9_4_X as well, because those are used in the HLT step for 2016 and 2017 MC (as far as I can tell, didn't quickly find a definitive source).

smuzaffar commented 1 year ago

ah looks like 10.2.X already has frontier client 2.9.1 https://github.com/cms-sw/cmsdist/pull/5707 but may be we never built a release out of it?

makortel commented 1 year ago

I see 10_2_29 has 2.9.1, maybe we'd need the "UL" variant of that? (was that just about using slc7 as the production architecture instead of slc6?)

makortel commented 1 year ago

In 9_4_X I see the last "UL" release CMSSW_9_4_16_UL has 2.8.20, whereas the latest release CMSSW_9_4_21 has 2.9.1.

In 8_0_X I see the latest release CMSSW_8_0_36_UL has 2.9.1.

So it seems to me the only possible action would be to build "UL" releases on the HEADs of 10_2_X and 9_4_X (or rebuilding their latest releases). @cms-sw/orp-l2

smuzaffar commented 1 year ago

We already have CMSSW_10_2_29 with frontier client 2.9.1. Can we move to that release? There is also CMSSW_10_2_16_UL2 but with old frontier client. If we have to stick to CMSSW_10_2_16_UL then we can build CMSSW_10_2_16_UL3 which should use CMSSW_10_2_16_UL2 tag of cmssw and REL/CMSSW_10_2_16_UL2/slc*_amd64_gcc700 + new frontier client

ArturAkh commented 1 year ago

Dear @smuzaffar and @makortel,

Do you have any news on this issue?

We still see a few jobs from CMS still running with the old _UL CMSSW releases, failing for the same reason as reported above.

Cheers,

Artur

smuzaffar commented 1 year ago

@ArturAkh , as I mentioned in https://github.com/cms-sw/cmssw/issues/40701#issuecomment-1421566098 , we need UL3 release. @perrotta @rappoccio , if no objectins then I can prepare the cmsdist banch/tag (which will be REL/CMSSW_10_2_16_UL2/slc7_amd64_gcc700 + new frontier client) for this release .

perrotta commented 1 year ago

Thank you @smuzaffar for taking care of it. So, the idea is to stick of 10_2_16 for UL: that seems correct to me, as newer 10_2_X releases add mostly simulation and generator stuffs, and probably they don't deserve a UL version. That's fine with me.

smuzaffar commented 1 year ago

see https://github.com/cms-sw/cmssw/issues/41316 , feel free to start the build process

ArturAkh commented 1 year ago

Dear all,

Are there any plans to cover the remaining release outlined here?

https://github.com/cms-sw/cmssw/issues/40701#issuecomment-1421541801

As far as I understood, CMSSW_9_4_16_UL would require something similar, right?

Cheers,

Artur

makortel commented 1 year ago

Thanks @ArturAkh for the ping.

@cms-sw/orp-l2 Should we (or, you) build e.g. 9_4_21_patch1_UL? Or 9_4_22 and 9_4_22_UL? (there are some PRs in the 9_4_X branch that are not yet part of any release) Or 9_4_16_UL2?

Just to remind, the new release would be used only by new Run 2 UL workflows, and only if @cms-sw/pdmv-l2 submits the new workflows using the new release. From that point of view one could ask first if a new 9_4_X UL release would make sense from @cms-sw/pdmv-l2 point of view?

perrotta commented 1 year ago

@makortel if there is the need to build a new release we will. Right now is probably a not so crowded period release-wise, and we can do so.

The exact release to be built depends on the exact needs. As far as I can see, all updates added on top of 9_4_16 either add new features, or improve the procedures without affecting their physics content. As such, if I have to build a new release, I would rather opt for making a 9_4_22 with the top of the HEAD, and then a UL version of it.

In any case, I would do so if and only if @cms-sw/pdmv-l2 really plans to submit new workflows with it.