cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.28k forks source link

Spurious DQM differences in MVA-related quantities (`tensorflow` related ?) #42525

Open missirol opened 1 year ago

missirol commented 1 year ago

The PR tests in https://github.com/cms-sw/cmssw/pull/42497#issuecomment-1670609280 showed unexpected differences in DQM comparisons of physics quantities.

The same PR tests also reported the following message (unexpectedly, based on the PR itself).

You potentially added 200 lines to the logs

This seems to be mostly due to the following message in the post-PR logs (see here)

023-08-09 03:27:47.746065: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.

@makortel noted in https://github.com/cms-sw/cmssw/pull/42497#issuecomment-1671329990 that

the baseline tests were run on Intel(R) Xeon(R) CPU E5-2683 v4 (Broadwell) whereas the PR tests were run on Intel(R) Xeon(R) Gold 5218 (Cascade Lake). I don't know if the log message above is related to the microarchitecture, or something orthogonal. [..] Tensorflow was updated last week (cms-sw/cmsdist#8565 [..], which likely explains why this printout has not been seen before.

cms-sw/cmsdist#8565 was integrated in CMSSW_13_3_X_2023-08-01-2300, and it updated Tensorflow to 2.12.0 (it was 2.6.4 in CMSSW_13_3_X_2023-08-01-1100).

[1]

136.793
136.874
138.4
138.5
139.001
141.042
312.0
10024.0
10042.0
10224.0
10824.0
11634.0
11634.911
11634.914
11834.0
12434.0
12434.7
250202.181

[2]

23234.0
24834.0
24834.911
24896.0
24900.0
25034.999
cmsbuild commented 1 year ago

A new Issue was created by @missirol Marino Missiroli.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 1 year ago

assign reconstruction, dqm

makortel commented 1 year ago
  • DQM differences are reported for several wfs, listed in [1] and [2]. All the wfs in [1] reported differences only in the DQM folder named Tracking, e.g.

FYI @cms-sw/tracking-pog-l2

cmsbuild commented 1 year ago

New categories assigned: dqm,reconstruction

@tjavaid,@micsucmed,@nothingface0,@rvenditti,@emanueleusai,@syuvivida,@clacaputo,@mandrenguyen,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

slava77 commented 1 year ago

type tracking

makortel commented 1 year ago

Occurred in https://github.com/cms-sw/cmssw/pull/42517#issuecomment-1671956044 between

makortel commented 1 year ago

Occurred in https://github.com/cms-sw/cmssw/pull/42506#issuecomment-1673099506 between

missirol commented 1 year ago

Another example in https://github.com/cms-sw/cmssw/pull/42562#issuecomment-1681201183 :

missirol commented 1 year ago

Another example in https://github.com/cms-sw/cmssw/pull/42610#issuecomment-1685128180 :

mmusich commented 1 year ago

Another example in https://github.com/cms-sw/cmssw/pull/42622#issuecomment-1688082001 :

missirol commented 1 year ago

Another example in https://github.com/cms-sw/cmssw/pull/42707#issuecomment-1703882846 :

smuzaffar commented 1 year ago

This happens when IB baseline and PR relvals are generated on two different sets of build nodes ( e.g. Openstack based VMs and HTCondor based batch nodes). For now I have updated jenkins jobs to not use HTCondor nodes for baseline/PR relvals. I hope this will reduce the frequency of these DQM differences