Closed nsmith- closed 3 years ago
A new Issue was created by @nsmith- Nicholas Smith.
@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign core
New categories assigned: core
@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks
@nsmith- How urgently would this information be needed?
For now it supplies only additional, monitoring information. So, within a few months would be OK
@nsmith- The reporting of close closures in EmbeddedRootSource
is now implemented in #35074. Is it sufficient to add this feature in 12_1_X, or would you need it in some earlier release cycles as well?
I would appreciate if we can backport this to some of the production releases, e.g. 10.6.
@nsmith- I just made a backport to 10_6_X in https://github.com/cms-sw/cmssw/pull/35180. If any other release cycle is needed, please ask.
10_6_X covers the digi step for UL so that should be OK for now
So it turns out (thanks @Dr15Jones) that https://github.com/cms-sw/cmssw/pull/35074 (and https://github.com/cms-sw/cmssw/pull/35180) are causing ASAN failure
==51632==ERROR: AddressSanitizer: attempting double-free on 0x60d0004c03a0 in thread T0:
#0 0x2b1d84988da7 in operator delete(void*) ../../../../libsanitizer/asan/asan_new_delete.cpp:160
#1 0x2b1d8837ff11 in __gnu_cxx::new_allocator<char>::deallocate(char*, unsigned long) /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_0_0_pre3-slc7_amd64_gcc10/build/CMSSW_12_0_0_pre3-build/BUILD/slc7_amd64_gcc10/external/gcc/10.3.0/gcc-10.3.0/obj/x86_64-unknown-linux-gnu/libstdc++-v3/include/ext/new_allocator.h:133
#2 0x2b1d8837ff11 in std::allocator_traits<std::allocator<char> >::deallocate(std::allocator<char>&, char*, unsigned long) /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_0_0_pre3-slc7_amd64_gcc10/build/CMSSW_12_0_0_pre3-build/BUILD/slc7_amd64_gcc10/external/gcc/10.3.0/gcc-10.3.0/obj/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/alloc_traits.h:492
#3 0x2b1d8837ff11 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_destroy(unsigned long) /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_0_0_pre3-slc7_amd64_gcc10/build/CMSSW_12_0_0_pre3-build/BUILD/slc7_amd64_gcc10/external/gcc/10.3.0/gcc-10.3.0/obj/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/basic_string.h:237
#4 0x2b1d8837ff11 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_dispose() /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_0_0_pre3-slc7_amd64_gcc10/build/CMSSW_12_0_0_pre3-build/BUILD/slc7_amd64_gcc10/external/gcc/10.3.0/gcc-10.3.0/obj/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/basic_string.h:232
#5 0x2b1d8837ff11 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_assign(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_0_0_pre3-slc7_amd64_gcc10/build/CMSSW_12_0_0_pre3-build/BUILD/slc7_amd64_gcc10/external/gcc/10.3.0/gcc-10.3.0/obj/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/basic_string.tcc:266
#6 0x2b1d8f8a2889 in edm::storage::StatisticsSenderService::filePreCloseEvent(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) (/cvmfs/cms-ib.cern.ch/nweek-02697/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_1_ASAN_X_2021-09-08-2300/lib/slc7_amd64_gcc10/libUtilitiesStorageFactory.so+0x3e889)
#7 0x2b1d9934e45b in edm::RootEmbeddedFileSequence::closeFile_() (/cvmfs/cms-ib.cern.ch/nweek-02697/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_1_ASAN_X_2021-09-08-2300/lib/slc7_amd64_gcc10/pluginIOPoolInput.so+0xd645b)
#8 0x2b1d993f474f in edm::RootInputFileSequence::initTheFile(bool, bool, edm::InputSource*, char const*, edm::InputType) (/cvmfs/cms-ib.cern.ch/nweek-02697/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_1_ASAN_X_2021-09-08-2300/lib/slc7_amd64_gcc10/pluginIOPoolInput.so+0x17c74f)
#9 0x2b1d993475b4 in edm::RootEmbeddedFileSequence::readOneRandom(edm::EventPrincipal&, unsigned long&, CLHEP::HepRandomEngine*, edm::EventID const*, bool) (/cvmfs/cms-ib.cern.ch/nweek-02697/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_1_ASAN_X_2021-09-08-2300/lib/slc7_amd64_gcc10/pluginIOPoolInput.so+0xcf5b4)
#10 0x2b1ddce4ba4c in unsigned long edm::VectorInputSource::loopOverEvents<edm::RecordEventID<std::_Bind<bool (edm::PreMixingModule::*(std::reference_wrapper<edm::PreMixingModule>, std::_Placeholder<1>, int, std::_Placeholder<2>, std::reference_wrapper<edm::EventSetup const>, edm::ModuleCallingContext const*, edm::PreMixingModule::AdjustPileupDistribution const*))(edm::EventPrincipal const&, int, int, edm::EventSetup const&, edm::ModuleCallingContext const*, edm::PreMixingModule::AdjustPileupDistribution const*)> > >(edm::EventPrincipal&, unsigned long&, unsigned long, edm::RecordEventID<std::_Bind<bool (edm::PreMixingModule::*(std::reference_wrapper<edm::PreMixingModule>, std::_Placeholder<1>, int, std::_Placeholder<2>, std::reference_wrapper<edm::EventSetup const>, edm::ModuleCallingContext const*, edm::PreMixingModule::AdjustPileupDistribution const*))(edm::EventPrincipal const&, int, int, edm::EventSetup const&, edm::ModuleCallingContext const*, edm::PreMixingModule::AdjustPileupDistribution const*)> >, CLHEP::HepRandomEngine*, edm::EventID const*, bool) (/cvmfs/cms-ib.cern.ch/nweek-02697/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_1_ASAN_X_2021-09-08-2300/lib/slc7_amd64_gcc10/pluginSimGeneralPreMixingModulePlugins.so+0x57a4c)
#11 0x2b1ddce4c6b0 in void edm::PileUp::readPileUp<std::_Bind<bool (edm::PreMixingModule::*(std::reference_wrapper<edm::PreMixingModule>, std::_Placeholder<1>, int, std::_Placeholder<2>, std::reference_wrapper<edm::EventSetup const>, edm::ModuleCallingContext const*, edm::PreMixingModule::AdjustPileupDistribution const*))(edm::EventPrincipal const&, int, int, edm::EventSetup const&, edm::ModuleCallingContext const*, edm::PreMixingModule::AdjustPileupDistribution const*)> >(edm::EventID const&, std::vector<edm::SecondaryEventIDAndFileInfo, std::allocator<edm::SecondaryEventIDAndFileInfo> >&, std::_Bind<bool (edm::PreMixingModule::*(std::reference_wrapper<edm::PreMixingModule>, std::_Placeholder<1>, int, std::_Placeholder<2>, std::reference_wrapper<edm::EventSetup const>, edm::ModuleCallingContext const*, edm::PreMixingModule::AdjustPileupDistribution const*))(edm::EventPrincipal const&, int, int, edm::EventSetup const&, edm::ModuleCallingContext const*, edm::PreMixingModule::AdjustPileupDistribution const*)>, int, edm::StreamID const&) (/cvmfs/cms-ib.cern.ch/nweek-02697/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_1_ASAN_X_2021-09-08-2300/lib/slc7_amd64_gcc10/pluginSimGeneralPreMixingModulePlugins.so+0x586b0)
#12 0x2b1ddce32c83 in edm::PreMixingModule::doPileUp(edm::Event&, edm::EventSetup const&) (/cvmfs/cms-ib.cern.ch/nweek-02697/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_1_ASAN_X_2021-09-08-2300/lib/slc7_amd64_gcc10/pluginSimGeneralPreMixingModulePlugins.so+0x3ec83)
#13 0x2b1ddc413d08 in edm::BMixingModule::produce(edm::Event&, edm::EventSetup const&) (/cvmfs/cms-ib.cern.ch/nweek-02697/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_1_ASAN_X_2021-09-08-2300/lib/slc7_amd64_gcc10/libMixingBase.so+0x4cd08)
#14 0x2b1d85baedde in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) (/cvmfs/cms-ib.cern.ch/nweek-02697/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_1_ASAN_X_2021-09-08-2300/lib/slc7_amd64_gcc10/libFWCoreFramework.so+0x909dde)
#15 0x2b1d85b1b492 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) (/cvmfs/cms-ib.cern.ch/nweek-02697/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_1_ASAN_X_2021-09-08-2300/lib/slc7_amd64_gcc10/libFWCoreFramework.so+0x876492)
#16 0x2b1d85809564 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) (/cvmfs/cms-ib.cern.ch/nweek-02697/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_1_ASAN_X_2021-09-08-2300/lib/slc7_amd64_gcc10/libFWCoreFramework.so+0x564564)
#17 0x2b1d85809daa in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) (/cvmfs/cms-ib.cern.ch/nweek-02697/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_1_ASAN_X_2021-09-08-2300/lib/slc7_amd64_gcc10/libFWCoreFramework.so+0x564daa)
#18 0x2b1d85815806 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() (/cvmfs/cms-ib.cern.ch/nweek-02697/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_1_ASAN_X_2021-09-08-2300/lib/slc7_amd64_gcc10/libFWCoreFramework.so+0x570806)
#19 0x2b1d8650b231 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) (/cvmfs/cms-ib.cern.ch/nweek-02697/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_1_ASAN_X_2021-09-08-2300/lib/slc7_amd64_gcc10/libFWCoreConcurrency.so+0x11231)
#20 0x2b1d87b467fe in tbb::detail::d1::task* tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter>(tbb::detail::d1::task*, tbb::detail::r1::external_waiter&) /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/slc7_amd64_gcc10/external/tbb/v2021.3.0-2fe36bef039e4e6ef0dfc9f6fbb735c9/tbb-v2021.3.0/src/tbb/task_dispatcher.h:321
#21 0x2b1d87b467fe in tbb::detail::d1::task* tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter>(tbb::detail::d1::task*, tbb::detail::r1::external_waiter&) /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/slc7_amd64_gcc10/external/tbb/v2021.3.0-2fe36bef039e4e6ef0dfc9f6fbb735c9/tbb-v2021.3.0/src/tbb/task_dispatcher.h:462
#22 0x2b1d87b467fe in tbb::detail::r1::task_dispatcher::execute_and_wait(tbb::detail::d1::task*, tbb::detail::d1::wait_context&, tbb::detail::d1::task_group_context&) /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/slc7_amd64_gcc10/external/tbb/v2021.3.0-2fe36bef039e4e6ef0dfc9f6fbb735c9/tbb-v2021.3.0/src/tbb/task_dispatcher.cpp:168
#23 0x2b1d8555553b in edm::EventProcessor::processLumis(std::shared_ptr<void> const&) (/cvmfs/cms-ib.cern.ch/nweek-02697/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_1_ASAN_X_2021-09-08-2300/lib/slc7_amd64_gcc10/libFWCoreFramework.so+0x2b053b)
#24 0x2b1d85585002 in edm::EventProcessor::runToCompletion() (/cvmfs/cms-ib.cern.ch/nweek-02697/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_1_ASAN_X_2021-09-08-2300/lib/slc7_amd64_gcc10/libFWCoreFramework.so+0x2e0002)
#25 0x40e961 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const (/cvmfs/cms-ib.cern.ch/nweek-02697/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_1_ASAN_X_2021-09-08-2300/bin/slc7_amd64_gcc10/cmsRun+0x40e961)
#26 0x2b1d87b2ef8e in tbb::detail::r1::task_arena_impl::execute(tbb::detail::d1::task_arena_base&, tbb::detail::d1::delegate_base&) /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/slc7_amd64_gcc10/external/tbb/v2021.3.0-2fe36bef039e4e6ef0dfc9f6fbb735c9/tbb-v2021.3.0/src/tbb/arena.cpp:696
#27 0x411e56 in main::{lambda()#1}::operator()() const (/cvmfs/cms-ib.cern.ch/nweek-02697/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_1_ASAN_X_2021-09-08-2300/bin/slc7_amd64_gcc10/cmsRun+0x411e56)
#28 0x40c605 in main (/cvmfs/cms-ib.cern.ch/nweek-02697/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_1_ASAN_X_2021-09-08-2300/bin/slc7_amd64_gcc10/cmsRun+0x40c605)
#29 0x2b1d88975554 in __libc_start_main (/lib64/libc.so.6+0x22554)
#30 0x40c948 (/cvmfs/cms-ib.cern.ch/nweek-02697/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_1_ASAN_X_2021-09-08-2300/bin/slc7_amd64_gcc10/cmsRun+0x40c948)
On a quick inspection it is rather evident that StatisticsSenderService::filePreCloseEvent()
is not thread safe. I made an attempt to quickly make it re-entrant, but I'm afraid the StatisticsSenderService
would need a more complete overhaul.
In fact I'm suspecting it could already now report incorrect servers. The server names are used in the fillUDP()
https://github.com/cms-sw/cmssw/blob/6f84b7b545d851cee15a6bdefff659548ac11002/Utilities/StorageFactory/src/StatisticsSenderService.cc#L241-L245
and they are being set in setCurrentServer()
https://github.com/cms-sw/cmssw/blob/6f84b7b545d851cee15a6bdefff659548ac11002/Utilities/StorageFactory/src/StatisticsSenderService.cc#L165-L169
that is being called from
https://github.com/cms-sw/cmssw/blob/6f84b7b545d851cee15a6bdefff659548ac11002/Utilities/XrdAdaptor/src/XrdRequestManager.cc#L241-L256
But in case pileup files are also being read over xrootd, what guarantees that at the time of primary file close the m_serverhost
would still point to the server providing the primary file and not a server for one of the pileup files?
Overhaul of StatisticsSenderService
will take some time, so unfortunately for now we need to revert the two PRs. I'll make those shortly.
@nsmith- (or @smuzaffar) Could you reopen this issue?
Is this somehow not a concern for the existing PoolSource usage of StatisticsSenderService? I think even in this case we could have multiple files open thanks to secondary inputs?
All sources do their work serially. It seems to me that the secondary input file closings are not reported to the StatisticsSenderService
given that the preFileClose
signal is emitted here (via FileCloseSentry
)
https://github.com/cms-sw/cmssw/blob/40d49e2537b51e42451082b51076f0b5a7f1634f/IOPool/Input/src/RootPrimaryFileSequence.cc#L117-L126
but not here
https://github.com/cms-sw/cmssw/blob/40d49e2537b51e42451082b51076f0b5a7f1634f/IOPool/Input/src/RootSecondaryFileSequence.cc#L57-L63
Then, given that the PoolSource
operates on one primary file at the time, reporting those without explicit protections against other threads is fine (except my earlier concern that the server may get messed up if secondary files or pileup files are read over xrootd).
Ok, and I guess this brings up another feature request to have secondary input file close be reported as well to StatisticsSenderService
:)
Here is another issue in addition to the server name. The size of the reported file is set here
https://github.com/cms-sw/cmssw/blob/40d49e2537b51e42451082b51076f0b5a7f1634f/Utilities/StorageFactory/src/StatisticsSenderService.cc#L172
(m_size
is atomic) and that gets called in TStorageFactoryFile::Initialize
https://github.com/cms-sw/cmssw/blob/40d49e2537b51e42451082b51076f0b5a7f1634f/IOPool/TFileAdaptor/src/TStorageFactoryFile.cc#L196-L199
that I guess gets called for every ROOT file opened in CMSSW (so also those opened for writing).
@nsmith- Could you reopen the issue?
I don't seem to have permission to do so
Ok :( Maybe because it was closed because of the linked PR being merged. @smuzaffar Could you reopen this issue?
reopened now
@nsmith- I just did a deep dive into our super archaic storage code (left over from before CMSSW existed and we were using CERN's POOL system). Turns out the statistics being sent are not about the primary file but instead are from every interaction with event ROOT file opened via the call to TFile::Open anywhere in the cmsRun job. This is baked into the heart of that system with a series of singletons and global variables (ick).
I THINK I can change the system without to much upheavel to report each Framework controlled TFile close (that is primary, secondary and embedded files) but I don't know what you'd want to do with the statistics. Would you want to always send that agregate or do you want different kinds of messages, the primary given the aggregate and the others just reporting the file closing?
@nsmith- we need further feedback from you to understand what changes could be done to still work with your use case.
I'm a bit puzzled how it is all files, because empirically we never see any pileup files in our monitoring that receives these messages. Is it possibly related to the thread safety issues of the existing implementation? For the sake of rucio popularity we really only need a record of each file use (i.e. one record per LFN per job), the single and vector read statistics would likely still be interesting for future analyses (actually they are summarized here but without much drilldown support) Can you clarify a bit what you mean by 'aggregate' here?
@nsmith-
I'm a bit puzzled how it is all files, because empirically we never see any pileup files in our monitoring that receives these messages. Is it possibly related to the thread safety issues of the existing implementation?
Although only 1 file name is reported, the statistics (e.g. bytes read) are actually the accumulated statistics for ALL ROOT files being used in the job.
And I assume reporting the isolated statistics per file is more challenging?
And I assume reporting the isolated statistics per file is more challenging?
It would be a near complete rewrite since it is full of singletons/global variables to do the accounting.
As a historical note, the code in question predates CMSSW and was taken from CERN's original 'SEAL' project which was trying to write a C++ toolkit.
And I assume reporting the isolated statistics per file is more challenging?
It would be a near complete rewrite since it is full of singletons/global variables to do the accounting.
Let me add that while a "near complete rewrite" of the StatisticsSenderService
would be straightforward to backport to production release cycles, I'm really hesitant for backports of "near complete rewrite" of the storage codebase (even if it is archaic).
I think part of the problem is a misinterpretation of the sent information. The info is not a 'per file' message, it is instead a snapshot of the accumulated state of the job at that instant in time. For the UDP message, that state is an arbitrary time. For the statistic sender service it is when the primary file was closed.
It seems now the request is to also send information when secondary/embedded files are closed. That could still send the present 'full job statistics summary' information along.
I see, I didn't realize it was an accumulated statistic over the duration of the job. I guess since there is a unique_id
field one can track job progress from these and only analyze the last such record received to not double-count in any analysis of these statistics. In this case, it should be fine to just send messages more often, i.e. when any secondary/embedded file is closed. For file popularity of course these byte statistics do not really matter.
@bbockelm what was your intent when you added the CondorStatusService and StatisticsSenderService?
So it looks like the StatisticsSenderService has three pieces of information that would need to be handled differently if we report more than just the primary file. These are _filelfn , _serverhost , and _serverdomain .
It seems like we already have a problem. Right now the XrdAdaptor::RequestManager appears to change the server_* information when ever ANY file changes its server. So right now the reports have the lfn to the primary file but might be reporting the server_* for the secondary/embedded file that last changed servers.
@nsmith- #35505 makes the secondary and embedded file closures to cause and UDP packet to be sent. The statistics are reset only on primary file closures (like currently). Can you comment if this is ok for the downstream that interprets the statistics? (the packet has timestamps for the beginning and end of the statistics collection period)
Is there a marker to tell whether the current file is a primary file in the packet? If that's the case then at least it is possible to reconstruct the correct statistics, if I understand correctly.
Is there a marker to tell whether the current file is a primary file in the packet?
I left the packet info exactly as it had been so as to not cause any problems with code reading those packets. Though I'd think it would be pretty easy to differentiate primary from pileup given pileup LNFs tend to have 'MinimumBias' in their name.
If you want something else done, please let us know. This information is not used in anyway by the framework so you can basically say what you want.
For pileup I agree, but if we also now catch secondary inputs (which, granted, do not ever get used in production jobs.. at least not yet) then it may be more difficult. It should be harmless to add a new field to the json, but I'll ping @vkuznet to confirm, since iirc he wrote the server daemon that parses these packets and injects them into the MONIT infrastructure.
@nsmith- @vkuznet we are still waiting for a response about our questions.
@nsmith- , yes, the new UDP server I wrote just reads incoming UDP packets, these packets contain JSON, and if JSON has invalid data or data-type the server drops it, and therefore we loose this information in our accounting. Each valid JSON is injected into MONIT infrastructure, while all invalid JSON are dropped.
@vkuznet the question is, do you need to keep the same content of the JSON, or can it be modified (.e.g. adding info about if this is the primary file)?
Since JSON is fed into MONIT infrastructure it is desired to have the same schema. But since we push data into ElasticSearch DB which is schemaless it is possible to change the schema. I suggest that we discuss and settle on a schema of the document. For clarity, here is current schema (I took a single document from ES):
{
"site_name": "T2_ES_CIEMAT",
"user_dn": "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=amaltaro/CN=718748/CN=Alan Malta Rodrigues/CN=1623138246/CN=1922216409/CN=2104111278/CN=1595834568",
"client_host": "gaew0199",
"client_domain": "ciemat.es",
"server_host": "gaeds014",
"server_domain": "ciemat.es",
"unique_id": "78537596-5E7E-0C42-8DAE-B041909DE5D4-0",
"file_lfn": "/store/mc/RunIISummer20UL17MiniAOD/ST_t-channel_antitop_4f_hdampup_InclusiveDecays_TuneCP5_13TeV-powheg-madspin-pythia8/MINIAODSIM/106X_mc2017_realistic_v6-v3/270000/5CC220C2-111D-7144-A91F-4A3EAFAE5FF4.root",
"file_size": 6502578,
"read_single_sigma": 19570.1,
"read_single_average": 1917.9,
"read_vector_average": 17335800,
"read_vector_sigma": 15408700,
"read_vector_count_average": 16.5444,
"read_vector_count_sigma": 18.5605,
"read_bytes": 1569091504,
"read_bytes_at_close": 1569091504,
"read_single_operations": 4625,
"read_single_bytes": 8870294,
"read_vector_operations": 90,
"read_vector_bytes": 1560221210,
"start_time": 1633969170,
"end_time": 1633971565
}
If you'll add new key-value pairs it would be easier to accommodate them in ES and monitoring infrastructure (in MONIT dashboards), but if you'll change data-types or drop attributes this will lead to conflicts which should be resolved. And, we advise to keep flat JSON, i.e. keep only key-value pairs, since it simplify indexing and visualization queries.
In this case we just propose to add a new key is_primary: bool
(or whatever name sounds good) to the record. While on the subject, one thing we've noticed is the field fallback
is either True or not present (null). Would be nice to populate the record with false when fallback is not used to open the file.
the most important thing is that you should validate your JSON. We often see that people confuse python dicts with valid JSON, and even here we're discussing C++ code which generates JSON, it is possible to supply not valid data-type which may cause issues (a typical error we saw many times, a number (int/float) is supplied as a string in JSON). Said that, if you define an attribute with some data-type you should ensure that generated JSON is fully compatible with that data-type.
@nsmith- from the code change I'm making, it is actually easy to say if this is a primary file, a 'secondary' file (I.e. the parent file when using the two file solution) or if this is from a 'secondary source' (i.e. from a pileup file or some other module usign the EmbeddedSource objects). Do you still only want is_primary: bool
or would you want something more inclusive like file_usage: string
with specified names , e.g. primary, secondary, embedded
?
Specified names sound even better!
One question, so if I understand the read_bytes
and related fields are only an accumulator for the entire job and get reset each time a new primary file is opened? What is read_bytes_at_close
for?
One question, so if I understand the read_bytes and related fields are only an accumulator for the entire job and get reset each time a new primary file is opened? What is read_bytes_at_close for?
I don't know. This was not code written by the framework team. We just 'inheritted' it. From the implementation
https://github.com/cms-sw/cmssw/blob/2ae6e5ddf11fbf5593e598253b2214e65e3186db/Utilities/StorageFactory/src/StatisticsSenderService.cc#L110-L113
read_bytes
and read_bytes_at_close
are identical. Going all the way back to the very first commit of that file by Brian, those two values have always been identical.
@nsmith- I double checked with MONIT team and there would be no problem if we'll add additional attributes to the existing schema. Therefore, you may add as many as you want. and I checked udp collector server, and it is transparent to the JSON itself, i.e. it does not care about its content. Therefore, I would assume that once you'll add one or more attributes they will simply pop-up in ES and we will only need to adjust our dashboards to use them as filters or attributes for any plots.
Thanks Valentin, this is great news! In fact I think this gives us licence to completely redesign the reporting if we so choose (and perhaps continue to send the existing data, just add a new packet with modified data)
Chris, given what you were saying about global variables and when things are updated, do you think server_domain
is trustworthy for the current records? Also in view of our multi-source algorithm, is it likely the server domain is only one of many?
For several years now, any cmsRun process at a site with
in their
site-local-config.xml
would send a small UDP packet on every primary file close, reporting the LFN and some summary statistics. The code responsible is found at https://github.com/cms-sw/cmssw/blob/master/Utilities/StorageFactory/src/StatisticsSenderService.cc which registers an event watcher:However, empirically this event seems to only fire for files opened through InputSource, and if you search for
postCloseFileSignal_
, only https://github.com/cms-sw/cmssw/blob/7e1bd4eeba8079d871a9dce744c32997c27ce53e/FWCore/Framework/src/InputSource.cc#L505-L508 shows up. It seems that the mixing module uses a different class chainEmbeddedRootSource : VectorInputSource
for file handling. Can we add apostCloseFileSignal_
to this as well?