cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.06k stars 4.26k forks source link

Instabilities in 11634.911 (DD4Hep) workflow comparisons #35109

Open makortel opened 2 years ago

makortel commented 2 years ago

We've observed differences in the DD4Hep workflow 11634.911 comparisons in tests of a few PRs that should not affect results of the DD4Hep workflow. This issue is to collect pointers to those comparisons.

cmsbuild commented 2 years ago

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 2 years ago

assign geometry

cmsbuild commented 2 years ago

New categories assigned: geometry

@Dr15Jones,@cvuosalo,@civanch,@ianna,@mdhildreth,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 2 years ago

Observed in https://github.com/cms-sw/cmssw/pull/35068#issuecomment-908500868 and https://github.com/cms-sw/cmssw/pull/34995#issuecomment-904980210

makortel commented 2 years ago

Here is another occurrence https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-cf3e63/18431/summary.html

civanch commented 2 years ago

@cvuosalo , is the problem back or it is another one?

cvuosalo commented 2 years ago

The instability appears to be random and rare. It is strange that wf 11634.912 does not show it. The difference between the two workflows is that 11634.911 runs the algorithms and calculates the reco geometry, while 11634.912 reads the already calculated algorithm results and reco geometry out of the DB.

cvuosalo commented 2 years ago

I ran workflow 11634.911 thirty times in CMSSW_12_1_X_2021-09-20-1100 with identical results each time. It appears the instability has gone away.

cmsbuild commented 2 years ago

This issue is fully signed and ready to be closed.

makortel commented 2 years ago

On the other hand the comparison differences have appeared rather rarely.

makortel commented 2 years ago

Here is another instance https://github.com/cms-sw/cmssw/pull/36222#issuecomment-977776359.

Could we re-open the issue (and keep it open for longer time)?

civanch commented 2 years ago

@makortel , I cannot, may be you can reopen?

makortel commented 2 years ago

I don't have the power. I'm not sure @qliphy / @perrotta have, or if we need @smuzaffar.

perrotta commented 2 years ago

Wow: I have the power!

cvuosalo commented 2 years ago

I ran DD4hep 11634.911 simulation 30 times in CMSSW_12_2_X_2021-12-01-1100 and got identical results each time.

Note that 11634.911, which is DD4hep XML, showed the instability reported in this issue, but 11634.0, which is DD4hep from DB, did not. That fact would seem to point to the Geometry Algortihm code, since the Algorithms are run for XML but not for DB, because the DB geometry contains the results of the Algorithms.

Without the ability to reproduce the instability, it is very hard to track it down and fix it.

makortel commented 2 years ago

I fully agree with the issue being hard to track down (it certainly appears to be much more rare than 1/30), but I think recording the times it happens would still be useful to have some hope of eventually resolving this. (the cause could be similar to #34448, which is also still a mystery)

makortel commented 2 years ago

Here is another instance https://github.com/cms-sw/cmssw/pull/36507#issuecomment-996225899

civanch commented 2 years ago

@makortel , @Dr15Jones , does the recent case means that it is needed to add a dependence between modules, which do not exchange data? If they should be executed in a specific order, then may be there can be pseudo data, for example, next module consume previous module but not data of this module?

cvuosalo commented 2 years ago

Rather than being a random instability, could it be that DD4hep simulation from XML is just very sensitive to any perturbation in the execution process? On the surface, #36507 appears to potentially alter order of execution, which could create some side effect. The other example above was a change to cmsDriver. Perhaps the instability is sensitive to infrastructure changes. Just guessing here.

perrotta commented 2 years ago

Here both 11634.911 and 11634.914 show instabilities. I am now relaunching the tests to see if it reproduces in a different baseline IB.

perrotta commented 2 years ago

See also https://github.com/cms-sw/cmssw/pull/36325#issuecomment-996082941. where besides 11634.911 and 11634.914 also 11634.911 and 11634.7 is affected

missirol commented 2 years ago

Just for the record, https://github.com/cms-sw/cmssw/pull/36515#issuecomment-995701965 is likely another example of this.

makortel commented 2 years ago

Happened second time in https://github.com/cms-sw/cmssw/pull/36507#issuecomment-996225899.

does the recent case means that it is needed to add a dependence between modules, which do not exchange data? If they should be executed in a specific order, then may be there can be pseudo data, for example, next module consume previous module but not data of this module?

Rather than being a random instability, could it be that DD4hep simulation from XML is just very sensitive to any perturbation in the execution process? On the surface, #36507 appears to potentially alter order of execution, which could create some side effect. The other example above was a change to cmsDriver. Perhaps the instability is sensitive to infrastructure changes. Just guessing here.

I agree, this starts to look similar to https://github.com/cms-sw/cmssw/issues/34448 that execution order of something affects the results (simulation history?). Ideally all our code should be independent of the execution order (that is only constrained by the declared data dependencies of the modules), and execution order of seemingly independent modules affecting results is an indication of a bug of some sort (e.g. data race, use of uninitialized variable).

makortel commented 2 years ago

Occurred also in #36530 (11634.{7,911,014}).

makortel commented 2 years ago

Occurred also in #36530 (11634.{7,911,014}).

And again https://github.com/cms-sw/cmssw/pull/36530#issuecomment-996920686, this time in 11634.7, 12434.0. This occurring only in non-X.911/X.914 makes it rather annoying to distinguish from "real" comparison failures.

cvuosalo commented 2 years ago

36235 shows loss of regression in the comparison test for 11634.911, but no others.

It seems instability has increased very recently. Could there be a recently merged PR that is responsible?

cvuosalo commented 2 years ago

36411 was merged just before we started seeing more instabilities. It makes a change in simulation by adding PPS simulation and enabling LHCTransport. Could there be unstable code in the new PPS simulation?

cvuosalo commented 2 years ago

36482 shows regression loss for 11634.7 and 11634.911.

cvuosalo commented 2 years ago

36351 shows regression loss for 11634.0 and 12434.0.

civanch commented 2 years ago

@makortel , @perrotta , @cvuosalo , If we roll back #36411 now? This may be achieved simply by addition of one line to #36524 - disable PPS for Run-3 but keeping other modifications of this PR. In modifier for Run-3 we use "False" instead of "True".

makortel commented 2 years ago

Good catch @cvuosalo. Static analyzer pointed to https://github.com/cms-sw/cmssw/blob/a9f44839b30cab42562af4434e1f01c8b1af8cc6/SimTransport/PPSProtonTransport/src/TotemTransport.cc#L82-L87 But it seems to me that this code is not being called in matrix workflows (I'm specifically looking into 12434.0). The const_cast should be removed nevertheless because the ApplyBeamCorrection() ends up modifying an input data product. I'll open a separate issue.

makortel commented 2 years ago

If we roll back #36411 now?

I think this would be useful. Massive spurious comparison differences disturb the testing of unrelated PRs, and we'd get another data point for testing the hypothesis of culprit being in PPS simulation code.

perrotta commented 2 years ago

@makortel , @perrotta , @cvuosalo , If we roll back #36411 now? This may be achieved simply by addition of one line to #36524 - disable PPS for Run-3 but keeping other modifications of this PR. In modifier for Run-3 we use "False" instead of "True".

@civanch I have disabled PPS for run3 in #36524, as you are suggesting: please check if https://github.com/cms-sw/cmssw/pull/36524/commits/45ef8d9527726d35521c9bfb2874a085d0ea22ad corresponds exactly to what you meant

civanch commented 2 years ago

Can we close this issue now?

makortel commented 2 years ago

Comparisons have been stable in this respect for a while, so I think we can close it now.

cvuosalo commented 2 years ago

+1

cmsbuild commented 2 years ago

This issue is fully signed and ready to be closed.

makortel commented 1 year ago

Let's record here that the tests in https://github.com/cms-sw/cmssw/pull/41273#issuecomment-1498963486 showed 5932 differences in the DQM comparisons of 11634.911 (and that being the only phase Run-{1,2,3} workflow showing differences). Running the tests for second time did not show any differences. The differences seemed to be across the board (i.e. not localized to a few subsystems)

makortel commented 1 year ago

Let's record here that the tests in https://github.com/cms-sw/cmssw/pull/41522#issuecomment-1535217784 showed 4822 differences in the DQM comparisons of 23634.911 across the board.

missirol commented 1 year ago

For the record, something similar happened in #41533: 47459 differences in the DQM comparisons of wf 23634.911.

makortel commented 1 year ago

@cms-sw/geometry-l2 Should we open a new issue to record these instabilities or reopen this one?

missirol commented 1 year ago

For the record, something similar happened in #41533: 47459 differences in the DQM comparisons of wf 23634.911.

Strange to me that https://github.com/cms-sw/cmssw/pull/41541#issuecomment-1535308120 reports exactly the same: 47459 differences in the DQM comparisons of wf 23634.911. I haven't seen this kind of differences often before, but twice today.

makortel commented 1 year ago

And another one in https://github.com/cms-sw/cmssw/pull/41532#issuecomment-1535500684, 4822 differences in workflow 23634.911.

perrotta commented 1 year ago

One more in https://github.com/cms-sw/cmssw/pull/41504#issuecomment-1535844334

makortel commented 1 year ago

Another one in https://github.com/cms-sw/cmssw/pull/41852#issuecomment-1574224215, 5582 differences in workflow 11634.911

makortel commented 1 year ago

(reopening the issue)

makortel commented 9 months ago

Another one in https://github.com/cms-sw/cmssw/pull/43041#issuecomment-1767197856, 6123 differences in workflow 11634.911. The CPU model was the same (Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz) for both the reference and the PR test.

makortel commented 7 months ago

To note here that https://github.com/cms-sw/cmssw/pull/43439 is removing 11634.911 from the short matrix, after which we would not see these instabilities anymore in PR tests.

AdrianoDee commented 7 months ago

To note here that #43439 is removing 11634.911 from the short matrix, after which we would not see these instabilities anymore in PR tests.

Let me know if you think it is preferable to keep it just to have this "constant reminder" of the issue or if it is something that we can leave to IB tests.

civanch commented 7 months ago

From my point of view, keeping this issues does not help much even if likely we have a problem with 11634.911, which is taken out of everyday testing.