Open makortel opened 3 years ago
A new Issue was created by @makortel Matti Kortelainen.
@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign geometry
New categories assigned: geometry
@Dr15Jones,@cvuosalo,@civanch,@ianna,@mdhildreth,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks
Here is another occurrence https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-cf3e63/18431/summary.html
@cvuosalo , is the problem back or it is another one?
The instability appears to be random and rare. It is strange that wf 11634.912 does not show it. The difference between the two workflows is that 11634.911 runs the algorithms and calculates the reco geometry, while 11634.912 reads the already calculated algorithm results and reco geometry out of the DB.
I ran workflow 11634.911 thirty times in CMSSW_12_1_X_2021-09-20-1100 with identical results each time. It appears the instability has gone away.
This issue is fully signed and ready to be closed.
On the other hand the comparison differences have appeared rather rarely.
Here is another instance https://github.com/cms-sw/cmssw/pull/36222#issuecomment-977776359.
Could we re-open the issue (and keep it open for longer time)?
@makortel , I cannot, may be you can reopen?
I don't have the power. I'm not sure @qliphy / @perrotta have, or if we need @smuzaffar.
Wow: I have the power!
I ran DD4hep 11634.911 simulation 30 times in CMSSW_12_2_X_2021-12-01-1100 and got identical results each time.
Note that 11634.911, which is DD4hep XML, showed the instability reported in this issue, but 11634.0, which is DD4hep from DB, did not. That fact would seem to point to the Geometry Algortihm code, since the Algorithms are run for XML but not for DB, because the DB geometry contains the results of the Algorithms.
Without the ability to reproduce the instability, it is very hard to track it down and fix it.
I fully agree with the issue being hard to track down (it certainly appears to be much more rare than 1/30), but I think recording the times it happens would still be useful to have some hope of eventually resolving this. (the cause could be similar to #34448, which is also still a mystery)
Here is another instance https://github.com/cms-sw/cmssw/pull/36507#issuecomment-996225899
@makortel , @Dr15Jones , does the recent case means that it is needed to add a dependence between modules, which do not exchange data? If they should be executed in a specific order, then may be there can be pseudo data, for example, next module consume previous module but not data of this module?
Rather than being a random instability, could it be that DD4hep simulation from XML is just very sensitive to any perturbation in the execution process? On the surface, #36507 appears to potentially alter order of execution, which could create some side effect.
The other example above was a change to cmsDriver
. Perhaps the instability is sensitive to infrastructure changes. Just guessing here.
Here both 11634.911 and 11634.914 show instabilities. I am now relaunching the tests to see if it reproduces in a different baseline IB.
See also https://github.com/cms-sw/cmssw/pull/36325#issuecomment-996082941. where besides 11634.911 and 11634.914 also 11634.911 and 11634.7 is affected
Just for the record, https://github.com/cms-sw/cmssw/pull/36515#issuecomment-995701965 is likely another example of this.
Happened second time in https://github.com/cms-sw/cmssw/pull/36507#issuecomment-996225899.
does the recent case means that it is needed to add a dependence between modules, which do not exchange data? If they should be executed in a specific order, then may be there can be pseudo data, for example, next module consume previous module but not data of this module?
Rather than being a random instability, could it be that DD4hep simulation from XML is just very sensitive to any perturbation in the execution process? On the surface, #36507 appears to potentially alter order of execution, which could create some side effect. The other example above was a change to
cmsDriver
. Perhaps the instability is sensitive to infrastructure changes. Just guessing here.
I agree, this starts to look similar to https://github.com/cms-sw/cmssw/issues/34448 that execution order of something affects the results (simulation history?). Ideally all our code should be independent of the execution order (that is only constrained by the declared data dependencies of the modules), and execution order of seemingly independent modules affecting results is an indication of a bug of some sort (e.g. data race, use of uninitialized variable).
Occurred also in #36530 (11634.{7,911,014}).
Occurred also in #36530 (11634.{7,911,014}).
And again https://github.com/cms-sw/cmssw/pull/36530#issuecomment-996920686, this time in 11634.7, 12434.0. This occurring only in non-X.911/X.914 makes it rather annoying to distinguish from "real" comparison failures.
It seems instability has increased very recently. Could there be a recently merged PR that is responsible?
@makortel , @perrotta , @cvuosalo , If we roll back #36411 now? This may be achieved simply by addition of one line to #36524 - disable PPS for Run-3 but keeping other modifications of this PR. In modifier for Run-3 we use "False" instead of "True".
Good catch @cvuosalo. Static analyzer pointed to
https://github.com/cms-sw/cmssw/blob/a9f44839b30cab42562af4434e1f01c8b1af8cc6/SimTransport/PPSProtonTransport/src/TotemTransport.cc#L82-L87
But it seems to me that this code is not being called in matrix workflows (I'm specifically looking into 12434.0). The const_cast
should be removed nevertheless because the ApplyBeamCorrection()
ends up modifying an input data product. I'll open a separate issue.
If we roll back #36411 now?
I think this would be useful. Massive spurious comparison differences disturb the testing of unrelated PRs, and we'd get another data point for testing the hypothesis of culprit being in PPS simulation code.
@makortel , @perrotta , @cvuosalo , If we roll back #36411 now? This may be achieved simply by addition of one line to #36524 - disable PPS for Run-3 but keeping other modifications of this PR. In modifier for Run-3 we use "False" instead of "True".
@civanch I have disabled PPS for run3 in #36524, as you are suggesting: please check if https://github.com/cms-sw/cmssw/pull/36524/commits/45ef8d9527726d35521c9bfb2874a085d0ea22ad corresponds exactly to what you meant
Can we close this issue now?
Comparisons have been stable in this respect for a while, so I think we can close it now.
+1
This issue is fully signed and ready to be closed.
Let's record here that the tests in https://github.com/cms-sw/cmssw/pull/41273#issuecomment-1498963486 showed 5932 differences in the DQM comparisons of 11634.911 (and that being the only phase Run-{1,2,3} workflow showing differences). Running the tests for second time did not show any differences. The differences seemed to be across the board (i.e. not localized to a few subsystems)
Let's record here that the tests in https://github.com/cms-sw/cmssw/pull/41522#issuecomment-1535217784 showed 4822 differences in the DQM comparisons of 23634.911 across the board.
For the record, something similar happened in #41533: 47459 differences in the DQM comparisons of wf 23634.911
.
@cms-sw/geometry-l2 Should we open a new issue to record these instabilities or reopen this one?
For the record, something similar happened in #41533: 47459 differences in the DQM comparisons of wf
23634.911
.
Strange to me that https://github.com/cms-sw/cmssw/pull/41541#issuecomment-1535308120 reports exactly the same: 47459 differences in the DQM comparisons of wf 23634.911
. I haven't seen this kind of differences often before, but twice today.
And another one in https://github.com/cms-sw/cmssw/pull/41532#issuecomment-1535500684, 4822 differences in workflow 23634.911.
Another one in https://github.com/cms-sw/cmssw/pull/41852#issuecomment-1574224215, 5582 differences in workflow 11634.911
(reopening the issue)
Another one in https://github.com/cms-sw/cmssw/pull/43041#issuecomment-1767197856, 6123 differences in workflow 11634.911. The CPU model was the same (Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz
) for both the reference and the PR test.
To note here that https://github.com/cms-sw/cmssw/pull/43439 is removing 11634.911 from the short matrix, after which we would not see these instabilities anymore in PR tests.
To note here that #43439 is removing 11634.911 from the short matrix, after which we would not see these instabilities anymore in PR tests.
Let me know if you think it is preferable to keep it just to have this "constant reminder" of the issue or if it is something that we can leave to IB tests.
From my point of view, keeping this issues does not help much even if likely we have a problem with 11634.911, which is taken out of everyday testing.
We've observed differences in the DD4Hep workflow 11634.911 comparisons in tests of a few PRs that should not affect results of the DD4Hep workflow. This issue is to collect pointers to those comparisons.