cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.09k stars 4.33k forks source link

ROOT 6 master: Histogram merge error #25463

Closed Dr15Jones closed 5 years ago

Dr15Jones commented 5 years ago

IN the ROOT6 IB, we are periodically seeing workflows (e.g. 137.8) failing in the DQM Harvest step from a new ROOT error message

----- Begin Fatal Exception 10-Dec-2018 08:32:43 CET-----------------------
    An exception of category 'FatalRootError' occurred while
       [0] Calling InputSource::readRun_
       Additional Info:
          [a] Fatal Root Error: @SUB=TH1Merger::CheckForDuplicateLabels
    Histogram eventsPerPath_all has duplicate labels in the x axis. Bin contents will be merged in a single bin

----- End Fatal Exception -------------------------------------------------
cmsbuild commented 5 years ago

A new Issue was created by @Dr15Jones Chris Jones.

@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos, @kpedro88 can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Dr15Jones commented 5 years ago

assign dqm

cmsbuild commented 5 years ago

New categories assigned: dqm

@jfernan2,@andrius-k,@schneiml,@kmaeshima you have been requested to review this Pull request/Issue and eventually sign? Thanks

andrius-k commented 5 years ago

Hi @Dr15Jones, which IB is crashing and could you please provide a link to the Jenkins page or any reference that we could look into?

Dr15Jones commented 5 years ago

The failure has appeared in several CSSW_10_4_ROOT6_X builds, the latest of which was CMSSW_10_4_X_2018-12-10-2300 which can be seen from the IB dashboard

https://cmssdt.cern.ch/SDT/html/cmssdt-ib/#/ib/CMSSW_10_4_X

The link to the most recent failed IB workflow is https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc700/CMSSW_10_4_ROOT6_X_2018-12-10-2300/pyRelValMatrixLogs/run/137.8_RunEGamma2018C+RunEGamma2018C+HLTDR2_2018+RECODR2_2018reHLT_skimEGamma_Prompt_L1TEgDQM+RunEGamma2018D+HLTDR2_2018+RECODR2_2018reHLT_skimEGamma_Prompt_L1TEgDQM+HARVEST2018_L1TEgDQM_MULTIRUN/step7_RunEGamma2018C+RunEGamma2018C+HLTDR2_2018+RECODR2_2018reHLT_skimEGamma_Prompt_L1TEgDQM+RunEGamma2018D+HLTDR2_2018+RECODR2_2018reHLT_skimEGamma_Prompt_L1TEgDQM+HARVEST2018_L1TEgDQM_MULTIRUN.log#/

schneiml commented 5 years ago

The 137.8 is the new multi-run harvesting Workflow. Good to see that it caught something. My crystal ball guess is something that was illegal/does not make sense even in the current production release, and was caught now, probably in some subsystem module that does not expect multi-run harvesting to happen.

However, I can't reproduce that currently; ROOT crashes on initialization on lxplus7 in the CMSSW_10_4_ROOT6_X_2018-12-10-2300 IB:

cmsRun: /build/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/slc7_amd64_gcc700/lcg/root/6.15.01/root-6.15.01/interpreter/llvm/src/tools/clang/include/clang/Serialization/Module.h:72: clang::serialization::InputFile::InputFile(const clang::FileEntry*, bool, bool): Assertion `!(isOverridden && isOutOfDate) && "an overridden cannot be out-of-date"' failed.

Any hints?

Dr15Jones commented 5 years ago

@pcanal any idea what this failure could be?

/build/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/slc7_amd64_gcc700/lcg/root/6.15.01/root-6.15.01/interpreter/llvm/src/tools/clang/include/clang/Serialization/Module.h:72: clang::serialization::InputFile::InputFile(const clang::FileEntry*, bool, bool): Assertion `!(isOverridden && isOutOfDate) && "an overridden cannot be out-of-date"' failed.
Dr15Jones commented 5 years ago

@smuzaffar what is the environment you use to run the ROOT6 IBs?

pcanal commented 5 years ago

@Dr15Jones sorta. This indicates that some of the headers files that are part of the ROOT pch files have been updated since ROOT was build. i.e. likely some system headers.

smuzaffar commented 5 years ago

@Dr15Jones , we use docker to build run ROOT6 IBs. All of 10.4.X Ibs now run under docker (as nearly all of them are slc7 based).

Dr15Jones commented 5 years ago

@pcanal @smuzaffar This seems extremely bad. This is implying that ROOT6 master can only run on a machine on which it was compiled. We need to determine what differences between the docker container and lxplus7 are causing the problem and make ROOT not care about them (since such diffferences are bound to happen on grid sites as well).

davidlange6 commented 5 years ago

adding @yamaguchi1024 as we talked of similar issues a week ago in the context of root modules..

On Dec 14, 2018, at 8:29 AM, Chris Jones notifications@github.com wrote:

@pcanal @smuzaffar This seems extremely bad. This is implying that ROOT6 master can only run on a machine on which it was compiled. We need to determine what differences between the docker container and lxplus7 are causing the problem and make ROOT not care about them (since such diffferences are bound to happen on grid sites as well).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

smuzaffar commented 5 years ago

we have opened a JIRA ticket https://sft.its.cern.ch/jira/browse/ROOT-9843 about this. On lxplus7 it is due to glibc version update. But our docker container we still have old glibc

[cmsbuild@050d8f6b1a80 build]$  rpm -qa | grep glibc
glibc-devel-2.17-222.el7.x86_64
glibc-common-2.17-222.el7.x86_64
glibc-2.17-222.el7.x86_64
glibc-headers-2.17-222.el7.x86_64
vgvassilev commented 5 years ago

Shahzad, could you bisect?

Is it possible that we build ROOT on a system with glibc version 2.17-260.el7 and the then deploy it on 2.17-222.el7 or vice versa?

If that’s the case I’d expect that the error is unclear but correct. ROOT would store a zip of the header files of glibc and then find out that some of them have changed.

smuzaffar commented 5 years ago

@vgvassilev , as we run under docker so this is not possible then we pick up different glic version. Anyway, yesterday we updated root master to bcd447b (commits from 18th DEC) also we use -DLLVM_BUILD_TYPE=Release but this workflow/test still fails with same error. Both root, and cmssw build was done (under docker) on the same machine where the test was run. So no chance that glic version could have changed.

yamaguchi1024 commented 5 years ago

Hi all,

I asked Axel about this histogram merge issue with a link to this test failure, and got the following reply:

Yes, but that's fairly old on ROOT's side, this was changed months ago. It's when having histograms with text labels, i.e.{"cat1": 12, "cat2": 13} We can merge two of these histograms, by simply creating the super-set of the bin labels, and then adding the values for each label. But in ROOT it's allowed to have {"cat1": 12, "cat1": 13}, i.e. repeated bin labels. And merging that will be - weird; we will be combining these labels, and likely that's not what the user expected, because they created two bins with the same label. So they need to think to do a conscious decision. I.e. this is not a bug in ROOT; this is likely a design issue on their side, with whomever fills that histogram.

I think Lorenzo and Axel are the responsive people for Histogram, so you can discuss this issue with them. Let me know if I can help also.

fioriNTU commented 5 years ago

+1

Fixed by the PR above

cmsbuild commented 5 years ago

This issue is fully signed and ready to be closed.