cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.28k forks source link

Instabilities in 11634.911 (DD4Hep) workflow comparisons #35109

Open makortel opened 3 years ago

makortel commented 3 years ago

We've observed differences in the DD4Hep workflow 11634.911 comparisons in tests of a few PRs that should not affect results of the DD4Hep workflow. This issue is to collect pointers to those comparisons.

makortel commented 8 months ago

To note here that #43439 is removing 11634.911 from the short matrix, after which we would not see these instabilities anymore in PR tests.

Let me know if you think it is preferable to keep it just to have this "constant reminder" of the issue or if it is something that we can leave to IB tests.

Good question. PR tests (including the short matrix) should be about ensuring the PRs behave as expected, and therefore I think using PR tests to stress-test reproducibility is likely not the best way.

If there is no other use for 11634.911 in short matrix (@cms-sw/geometry-l2 could you comment?), I'd be in favor of dropping 11634.911 from the short matrix. Unfortunately IBs themselves don't provide any facilities for inspecting workflow results. @smuzaffar Maybe we should think about something here, at least for select workflows? (not really optimal, but maybe better than (mis)using PR tests?)

makortel commented 5 months ago

Just to note that in the end https://github.com/cms-sw/cmssw/pull/43439 kept 11634.911

srimanob commented 4 months ago

Hi @makortel I think this issue is solved, should we close it? Thx.

makortel commented 4 months ago

Do we know how the issue got resolved? Or is it just not occurring anymore?

srimanob commented 4 months ago

The workflow in topic is Run-3, right? As DD4hep is run by default in Run-3 workflow (.911 = .0 for Run-3), I think we don't see any instabilities any more. Do I miss some points that we should keep investigating Run-3 DD4hep workflow?

makortel commented 4 months ago

From the history the frequency seems to have been one occurrence every 1-4 months (although I suspect not all L2s report those).

Earlier comments suggest that .911 and .0 are different, by .911 reading the geometry from XML and .0 from the DB.

srimanob commented 4 months ago

From the history the frequency seems to have been one occurrence every 1-4 months (although I suspect not all L2s report those).

Earlier comments suggest that .911 and .0 are different, by .911 reading the geometry from XML and .0 from the DB.

Ah, you are right. .911 is XML version, and .912 (which is .0 default now) is DB. Do we need to monitor XML when we use DB? I mean we don't do Run-1, Run-2 XML (DDD) anymore. So, we never know if there is an issue there or not.