Open ArturAkh opened 1 year ago
A new Issue was created by @ArturAkh Artur Gottmann.
@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
Assign core
New categories assigned: core
@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks
Thanks @ArturAkh for reporting and diagnosing the problem.
@smuzaffar I suppose updating frontier in 10_6_X could be fairly straightforward (fingers crossed). That would imply a new release (would be 10_6_33 as of today).
My understanding is that updating the CMSSW release in submitted workflows would be tedious (but I'll include @cms-sw/pdmv-l2 to confirm), and therefore, in the leading order, the fix could be included only in new workflow submission.
@ArturAkh Do you know if it would be easy for KIT to not accept workflows that use specific CMSSW versions? (just wondering possible stopgap measures)
Hi @makortel,
Thanks a lot for taking care of this!
In principle, we could reject workflows based on CMSSW version.
However, since we have seen that problem on a minor subsite at KIT, so in particular not at the Tier 1 resources, it would't be a major issue to wait for the switch to new workflows.
Currently, we don't observe the problem - it appears on the subsite in question from time to time.
Cheers,
Artur
@smuzaffar I suppose updating frontier in 10_6_X could be fairly straightforward (fingers crossed). That would imply a new release (would be 10_6_33 as of today).
@makortel , 10.6.X is already using frontierclient 2.9.1 . Looks like we need to update it for 10.2.X release cycle but it should be doable (it deps on just expat openssl pacparser python zlib
) , lets hope the version 2.9.1 works for slc6 :-)
10.6.X is already using frontierclient 2.9.1 .
Oh nice.
Looks like we need to update it for 10.2.X release cycle but it should be doable
Ah right, somehow missed the description mentioning 10_2_16_UL
. I believe we should update the 8_0_X and 9_4_X as well, because those are used in the HLT step for 2016 and 2017 MC (as far as I can tell, didn't quickly find a definitive source).
ah looks like 10.2.X already has frontier client 2.9.1 https://github.com/cms-sw/cmsdist/pull/5707 but may be we never built a release out of it?
I see 10_2_29 has 2.9.1, maybe we'd need the "UL" variant of that? (was that just about using slc7 as the production architecture instead of slc6?)
In 9_4_X I see the last "UL" release CMSSW_9_4_16_UL has 2.8.20, whereas the latest release CMSSW_9_4_21 has 2.9.1.
In 8_0_X I see the latest release CMSSW_8_0_36_UL has 2.9.1.
So it seems to me the only possible action would be to build "UL" releases on the HEADs of 10_2_X and 9_4_X (or rebuilding their latest releases). @cms-sw/orp-l2
We already have CMSSW_10_2_29 with frontier client 2.9.1
. Can we move to that release? There is also CMSSW_10_2_16_UL2 but with old frontier client. If we have to stick to CMSSW_10_2_16_UL then we can build CMSSW_10_2_16_UL3 which should use CMSSW_10_2_16_UL2 tag of cmssw and REL/CMSSW_10_2_16_UL2/slc*_amd64_gcc700 + new frontier client
Dear @smuzaffar and @makortel,
Do you have any news on this issue?
We still see a few jobs from CMS still running with the old _UL
CMSSW releases, failing for the same reason as reported above.
Cheers,
Artur
@ArturAkh , as I mentioned in https://github.com/cms-sw/cmssw/issues/40701#issuecomment-1421566098 , we need UL3 release. @perrotta @rappoccio , if no objectins then I can prepare the cmsdist banch/tag (which will be REL/CMSSW_10_2_16_UL2/slc7_amd64_gcc700 + new frontier client) for this release .
Thank you @smuzaffar for taking care of it. So, the idea is to stick of 10_2_16 for UL: that seems correct to me, as newer 10_2_X releases add mostly simulation and generator stuffs, and probably they don't deserve a UL version. That's fine with me.
see https://github.com/cms-sw/cmssw/issues/41316 , feel free to start the build process
Dear all,
Are there any plans to cover the remaining release outlined here?
https://github.com/cms-sw/cmssw/issues/40701#issuecomment-1421541801
As far as I understood, CMSSW_9_4_16_UL would require something similar, right?
Cheers,
Artur
Thanks @ArturAkh for the ping.
@cms-sw/orp-l2 Should we (or, you) build e.g. 9_4_21_patch1_UL? Or 9_4_22 and 9_4_22_UL? (there are some PRs in the 9_4_X branch that are not yet part of any release) Or 9_4_16_UL2?
Just to remind, the new release would be used only by new Run 2 UL workflows, and only if @cms-sw/pdmv-l2 submits the new workflows using the new release. From that point of view one could ask first if a new 9_4_X UL release would make sense from @cms-sw/pdmv-l2 point of view?
@makortel if there is the need to build a new release we will. Right now is probably a not so crowded period release-wise, and we can do so.
The exact release to be built depends on the exact needs. As far as I can see, all updates added on top of 9_4_16 either add new features, or improve the procedures without affecting their physics content. As such, if I have to build a new release, I would rather opt for making a 9_4_22 with the top of the HEAD, and then a UL version of it.
In any case, I would do so if and only if @cms-sw/pdmv-l2 really plans to submit new workflows with it.
Dear CMSSW experts,
At one of our subsites at KIT, we have encountered a number of failed jobs (example), which have the following error:
This has happened for jobs running with CMSSW_10_2_16_UL release.
According to our investigation, it seems to be a bug in frontier client in tag
cms/2.8.20
:https://github.com/cms-externals/frontier_client/blob/e96f07fe14a188580470cbbd27ad3fc9b458b5ca/http/fn-urlparse.c#L57-L62
The client expects a http in front of the IP or hostname, which is contrary to what is written in the PAC specification. It is fixed in tag
cms/2.9.1
So we assume, that CMS would require a new patch release picking up the new tag.
Thank you very much in advance for having a look into this.
Artur Gottmann