cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.31k forks source link

concurrent histogram filling #19395

Open fwyzard opened 7 years ago

fwyzard commented 7 years ago

The current DQM framework deals with concurrency by keeping a copy of each MonitorElement for each stream. This scales well, but incurs in a high memory usage (or rather, it results in no memory reduction from multithreading).

@jpivarski has studied a possible approach:

I did a study of dense, concurrent histograms https://indico.cern.ch/event/607830/contributions/2613395/ to entice someone to implement them as a part of ROOT:

What I found is that copy-and-swap and std::atomic algorithms scale extremely well— the horizontal axis is dozens of threads— and that the failure to scale is dominated by something other than contention (two threads wanting to increment the same bin at the same time). We know this because the scaling of a naive implementation (hist[i]++) is not much better. At these GHz rates, it's probably a hardware limitation— serializing on the memory bus or copying from CPU cache to main memory.

Sparse histograms would be a different story, because that's more like a hash map than an array. But given the success of std::atomic[NUM_BINS], it would be worth trying std::map<long, std::atomic>. Or the copy-and-swap method, because the you can see what's going on and measure collision rate. (std::atomic mysteriously does better than the naive implementation in some cases, and not understanding that makes me wary).

Oh, and I've put all the code for these studies on https://github.com/diana-hep/parallel-histogram-metrics . Feel free to fork it.

I have looked at some alternatives as well, based on a toy histogram class with float content and int counter

From a first look, both with high contention (i.e. a test job doing only histogram filling with N threads) and with lower contention (i.e. burning cpu on unrelated operations between filling the histograms) std::atomic<> gives the best performance.

cmsbuild commented 7 years ago

A new Issue was created by @fwyzard Andrea Bocci.

@davidlange6, @Dr15Jones, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

dmitrijus commented 7 years ago

Hi @fwyzard,

Atomics are problematic, unless they will be supported by root. The tests mentioned above don't track statistics/std error, which users often rely on.

One solution would be to implement a 'special' kind of monitor element, bookAtomicTH1F. For which any direct access to the TH1F would be disabled (getTH1 will throw an exception), Fill() is implemented by the DQM and does atomic compare-and-swap.

I had a quick chat with @VinInn and he presented another idea, which I really like: instead of playing with atomics or explicit locking, why not simply let the framework handle the parallel execution / scheduling. In short, remove parallelism from individual modules, but exploit "module level" concurrency.

In this case, we would have a special "one" kind of a module. Not the current "one" module as defined in the framework, but "one" as in "analyze which is called for every event", but multiple "one" modules are allowed to run in parallel. So, DQMEDAnalyzer would only be responsible for filling its own histograms and never have to deal with concurrency as it would always run within its thread. However, the other threads would not be empty or waiting for that module finish - they would be busy running other DQMEDAnalyzers. Currently, DQMEDAnalyzer do not depend on each other - they share nothing (except booking, but that goes through a mutex anyway) and thus can be run in parallel. There are tons of DQM modules, each subsystem has hundreds of them, so threads would always be occupied anyway.

I don't know if this is possible, something to be discussed with @davidlange6 and @Dr15Jones. But it would simplify the entire DQM, there would be simply no need to merge histograms or keep per-thread copies of them.

Obviously, this is not efficient if you have a single "heavy" module filling one histogram, but is not the case in DQM. In DQM, there are lots of "light" modules filling multiple histograms.

Dmitrijus

davidlange6 commented 7 years ago

I'm not sure I follow what differs w.r.t. a current "one" module. Changing some modules to one modules has been suggested a few times recently without reaction from DQM - so maybe you can give it a try and report back about what event throughput is lost?

On Jun 21, 2017, at 5:02 PM, Dmitrijus notifications@github.com wrote:

Hi @fwyzard.

Atomics are problematic, unless they will be supported by root. The tests mentioned above don't track statistics/std error, which some users often rely on.

One solution would be to implement a 'special' kind of monitor element, bookAtomicTH1F. For which any direct access to the TH1F would be disabled (getTH1 will throw an exception), and Fill() is implemented by the DQM and does atomic compare-and-swap.

I had a quick chat with @VinInn and he presented another idea, which I really like: instead of playing with atomics or explicit locking, why not simply let the framework handle the parallel execution / scheduling. In short, remove parallelism from individual modules, but exploit "module level" concurrency.

In this case, we would have a special "one" kind of a module. Not the current "one" module as defined in the framework, but "one" as in "analyze which is called for every event", but multiple "one" modules are allowed to run in parallel. So, DQMEDAnalyzer would only be responsible for filling its own histograms and never have to deal with concurrency as it would always run within its thread. However, the other threads would not be empty or waiting for that module finish - they would be busy running other DQMEDAnalyzers. Currently, DQMEDAnalyzer do not depend on each other - they share nothing (except booking, but that goes through a mutex anyway) and thus can be run in parallel. There are tons of DQM modules, each subsystem has hundreds of them, so threads would always be occupied anyway.

I don't know if this is possible, something to be discussed with @davidlange6 and @Dr15Jones. But it would simplify the entire DQM, there would be simply no need to merge histograms or keep per-thread copies of them.

Obviously, this is not efficient if you have a single "heavy" module filling one histogram, but is not the case in DQM. In DQM, there are lots of "light" modules filling multiple histograms.

Dmitrijus

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Dr15Jones commented 7 years ago

@dmitrijus multiple one modules can already run concurrently. However, this still limits the concurrency of the framework once the number of threads grows to be within the same order of magnitude as the number of one modules. Actually, one really slow one module can slow down the entire application (such as what happens with the PoolOutputModule).

So I agree that to deal with excessing memory use by a module, we could convert that one to a one module (assuming that doesn't hurt our concurrency). However, converting all to one will limit our ability to scale.

fwyzard commented 7 years ago

Re-reading the comments, it's not clear if the proposal is to try one modules with individual resources, or one module for a single, shared "DQM" resource. The latter would indeed prevent any cross-talk between modules, and should be easy enough to prototype, but might hit a scalability problems.

Dr15Jones commented 7 years ago

Each one effectively has its own resource. As for sharing a MonitorElement across modules, I believe that would already break the present system.

fwyzard commented 7 years ago

OK, let me try again after a glass of wine...

I agree that using something like

class DQMAnalyzer : public edm::one::EDAnalyzer<edm::one::SharedResources>
{
    DQMAnalyzer(edm::ParameterSet const& config)
    {
        usesResource("DQM");
    }

    ...
};

should be 100% safe, at a heavy cost in concurrency, as the framework would run only a single DQM module at a given time.

Something like

class DQMAnalyzer : public edm::one::EDAnalyzer<>
{
    DQMAnalyzer(edm::ParameterSet const& config)
    { }
    ...
};

should be much better: it would guarantee that only one copy of each DQMAnalyzer module is active at a time, so the DQM framework could drop some of the support for concurrency (i.e. multiple copies of each histogram, merging, etc.). It would still need to support having multiple runs and lumisections "open" at a given time, though.

An alternative, as suggested by @dmitrijus, could be make the MonitorElement a locking wrapper around the underlying TH1 objects. For performance reasons I think we should still allow for some concrete-type-aware construct (e.g. getTH1F() could return a wrapper templated for a TH1F). This approach could work well as long as there is little or no contention among different instances of the same module wanting to fill the same ME.

These proposals would likely break the FastTimerService as it is written now, but I think this can be worked around rewriting (parts of) it for the 4th time. I think whoever embarks the decision to make the change should take care of that, too.

Last option, we roll our own concurrent histogram class. For this, it looks like (some numbers to follow) using atomic types is preferable to using locking. If we only care about filling histograms that should work fine. If we want to have concurrent filling and reading (of statistics, etc.) some extra care will be needed; again it could be done either using atomics or locks, the former likely being more performant.

Final comment: I think root 7 is considering some form of concurrent histograms (based on a hopefully-concurrent queue approach). I don't have more details than https://root.cern.ch/doc/v608/concurrentfill_8cxx_source.html . And I think the implementation follows the horrible approach used for histograms so far, and I wouldn't touch with a 10' wooden pole.

So, about root histograms - what I don't like is

Did I leave anything out ?

VinInn commented 7 years ago

On 21 Jun, 2017, at 6:35 PM, Andrea Bocci notifications@github.com wrote:

I had a quick chat with @VinInn and he presented another idea, which I really like: instead of playing with atomics or explicit locking, why not simply let the framework handle the parallel execution / scheduling. In short, remove parallelism from individual modules, but exploit "module level" concurrency.

Actually I was thinging of a more general fexible scheme in which the framework does not create pre-emptively N stream-modules but wait to creates one more only "if needed". In this mode essentially the scheduler will try to fill threads with in-event concurrency and will move to good-old event parallelism only if forced to run the same module for more than one event. Excluding some long component of iterative tracking (and also there one must see at which level) I bet we can reduce the number of instantiated module by a order of magnitude or so...

v.

-- Il est bon de suivre sa pente, pourvu que ce soit en montant. A.G. http://www.flickr.com/photos/vin60/1320965757/

fwyzard commented 7 years ago

@Dr15Jones, @dmitrijus, can you (both) confirm that you would rather use edm::one modules with no mutex/locks in the MonitorElement class, than edm::stream modules and a mutex/lock around MonitorElement::fill ?

@Dr15Jones, if one uses an edm::one module, does the framework guarantees that it will never see intermixed events from different lumisections or runs ?

fwyzard commented 7 years ago

Ah, I should have read FWMultithreadedFrameworkOneModuleInterface before asking...

So, can you confirm that you would rather have the DQM analyzer modules inherit from edm::one::EDAnalyzer<edm::one::WatchRuns, edm::one::WatchLuminosityBlocks> (thus forcing the whole jobs to analyse a single run and lumisection at a time), than stick to edm::stream::EDAnalyzer modules and adding a lock to MonitorElement::Fill() ?

Dr15Jones commented 7 years ago

Given that the DQM modules all reside on the same EndPath and modules in a path must be processed in the order they are on the path, making the DQM modules one modules would lead to much more stalls across streams then a mutex. However, changing the DQM system to return the same histogram to all stream instances may take a significant amount of re-engineering.

fwyzard commented 7 years ago

This may be off topic but, if the one modules do not have any dependency on each other, is there no way to schedule them to run in parallel ?

VinInn commented 7 years ago

On 25 Jul, 2017, at 10:02 PM, Chris Jones notifications@github.com wrote:

Given that the DQM modules all reside on the same EndPath why? I schedule them as any other analyzer... (adn I am not the only one) and modules in a path must be processed in the order they are on the path, WHY? modules are independent, they are connected only by consume and produce. Analyzer can be run in any order. making the DQM modules one modules would lead to much more stalls across streams then a mutex. However, changing the DQM system to return the same histogram to all stream instances may take a significant amount of re-engineering. I think we need a valid long term solution: I do not see any role of streams for DQM modules: please should move to Global and learn how to deal with parallel programming. The CMSSW migration has been very sucessful, still it is now pampering developers a bit too much. I thnk that Modules should be disconnected from streams, the scheduler should instance the minimum number of them to make concurrency effective: I bet that most of time the streams seldom run the same module in parallel and for sure this can be easily minimized with a more flexible schedule.

vincenzo

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

-- Il est bon de suivre sa pente, pourvu que ce soit en montant. A.G. http://www.flickr.com/photos/vin60/1320965757/

fwyzard commented 7 years ago

Hi, moving the DQM part of the DQMEDAnalyzers to edm::global would not be all that complicated:

However this makes the assumption that the users' code in the DQMEDAnalyzer is global-friendly, which in general may not be the case.

This is why I was proposing an intermediate approach:

Either way we need to protect the underlying ROOT object from concurrent access (which was the point of the whole thread here). The only "quick and dirty" way Marco and I could think of was simply to add a lock in the call to MonitorElement::Fill(), and declare any other usage as non-safe.

Moving instead to edm::one modules makes away with the need of the lock, at the price of lower overall concurrency.

Back to the question: assuming the complexity to implement the change is roughly the same, do people prefer to have the DQM analyzer modules inherit from edm::one::EDAnalyzer<edm::one::WatchRuns, edm::one::WatchLuminosityBlocks> (thus forcing the whole jobs to analyse a single run and lumisection at a time), or to keep using edm::stream::EDAnalyzer modules and adding a lock to MonitorElement::Fill() ?

If we decide to try the latter, I have few follow up questions/comments...

VinInn commented 7 years ago

On 26 Jul, 2017, at 9:30 AM, Andrea Bocci notifications@github.com wrote:

If we decide to try the latter, I have few follow up questions/comments...

• is there a way to query the framework about the concurrency level of the job (number of streams), and avoid the locking for single-stream jobs ? I know how to check the concurrency level in a Service, but not in an EDAnalyzer; also the most common idiom (std::lockguard lock(mutex);) would not work inside an if() {...} block :-) is there any real cost is lock unlock? as anybody measured it? btw for an example have a look to https://github.com/VinInn/cmssw/blob/90d013fbf403f1ff25bd6c4ea8f582a6dbd7c4e3/RecoPixelVertexing/PixelLowPtUtilities/bin/PixelClusterShapeExtractor.cc#L220 we can always implement a faster in-line lock using atomics,spinner,std::yield etc to avoid a system call

• @rovere was proposing to add a second DQMGlobalEDAnalyzer class that inherits from edm::global::EDAnalyzer instead. This would allow experimenting with it, while the majority of the modules remain edm::stream::EDAnalyzer

-- Il est bon de suivre sa pente, pourvu que ce soit en montant. A.G. http://www.flickr.com/photos/vin60/1320965757/

fwyzard commented 7 years ago

On my laptop, running with a single thread

VinInn commented 7 years ago

On 26 Jul, 2017, at 10:04 AM, Andrea Bocci notifications@github.com wrote:

On my laptop, running with a single thread

• locking with a mutex takes 17 ns • locking with a spinlock on an atomic_flag takes 7 ns, but scales worse if there is contention you need to give yield after some few 100K iterations... see OMP implementation for instance in this long thread https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706

v.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

-- Il est bon de suivre sa pente, pourvu que ce soit en montant. A.G. http://www.flickr.com/photos/vin60/1320965757/

fwyzard commented 7 years ago

OK, I'll have to review my simple spinlock... is there any c++'14 spinlock around ?

VinInn commented 7 years ago

On 26 Jul, 2017, at 10:39 AM, Andrea Bocci notifications@github.com wrote:

OK, I'll have to review my simple spinlock... is there any c++'14 spinlock around ? no. it is a detail of the lock_guard.... v.

Dr15Jones commented 7 years ago

I'm for testing the lock mechanism since it is the right direction for eventually using a histogram which uses atomics for its bins.

dmitrijus commented 7 years ago

The locking is a terrible approach.

It is way too aesy to leak the TH1 object. Debugging histogram filling has been a single worst nightmare for the DQM, this makes it worse as you can't really protect the access to TH1.

Currrently, MonitorElement interface is not enough to cover all what ways people use TH1.

... And it will involve a huge migration of the entire DQM. Last time this happened, it took 3 years and we are still not done with.

My vote is to really use:

edm::one::EDAnalyzer<edm::one::WatchRuns, edm::one::WatchLuminosityBlocks>

and keep the current interface/output/merging/logic the same.

thus forcing the whole jobs to analyse a single run and lumisection at a time

The only place I know this would detrimental for, is Online HLT. Which is a special case and also we maintain a separate pathways for it. For online HLT, the modules should continue to use the current ::stream:: module.

The CMSSW migration has been very sucessful, still it is now pampering developers a bit too much.

There are way too many non-DQM-core DQM developers. A third of them have issues with git rebase. We will be unable to enforce anything, but the most simple interface and fool-proof way to fill the histograms.

Dmitrijus

Dr15Jones commented 7 years ago

Using ::one modules would require the following to allow good concurrency

Dr15Jones commented 7 years ago

@fwyzard

This may be off topic but, if the one modules do not have any dependency on each other, is there no way to schedule them to run in parallel ?

My original plan was to do just that. However, we had to many examples of modules having hidden dependencies on other modules (e.g. do you know if any two DQM modules are not filling the same histogram?) so keeping Paths/EndPaths to be sequential was the easiest way forward.

fwyzard commented 7 years ago

However, we had to many examples of modules having hidden dependencies on other modules (e.g. do you know if any two DQM modules are not filling the same histogram?) so keeping Paths/EndPaths to be sequential was the easiest way forward.

Still off topic... would it make sense to try that in an experimental branch, and check how many (other) things break ?

dmitrijus commented 7 years ago

Perhaps, there should be a way to mark a cms.Path "parallel"?

DQMEDAnalyzer modules do not need to run in a sequence, they have no dependencies on between each other. This is enforced as is - it is impossible for these modules to fill the same histogram, the interface does not allow to book such histogram!

This is in contrast with DQMEDHarvester, these modules must run single threaded, in sequence and share histograms between each other.

fwyzard commented 7 years ago

DQMEDAnalyzer modules not need to run in a sequence, they have no dependencies on between each other. This is enforced as it is, it is impossible for these modules to fill the same histogram, the interface does not allow that!

That's interesting to know, because this has been one of the sore points. How is that actually disallowed ? What happens if two modules try to book the same histogram in the same directory ?

dmitrijus commented 7 years ago

They each get a separate ME object. streamId / moduleId will be unique even if the histogram path is the same. Even if two modules book the histogram with the same path, they will each get their own unique TH1 objects.

Once the modules reach mergeAndResetMEsSummaryCache, their own copy is merged into a single* global histogram. But that happens within a lock, merge order is undefined (as is).

DQMEDAnalyzer cannot access this global element (stream, moduleId = 0). Only DQMEDHarvesters can.

fwyzard commented 7 years ago

OK, so the problem of a possible inconsistency happens only at the endlumi/endrun merging step. Thanks, now I understand better why my fix at #19912 actually works.

fwyzard commented 7 years ago

@Dr15Jones @dmitrijus I have a first implementation of locking all accesses to the MonitorElements' ROOT objects: https://github.com/cms-sw/cmssw/pull/20158 .

Can you have a look and send me your comments ?

It is still incomplete (see the TODO list) and I haven't tried to update all the client code yet.

fwyzard commented 7 years ago

...and as soon as I started looking at user code to fix, I found a use case we had not discussed:

    auto theMEHistogram = theMonitor->getTH1F();
    [...]
    Double_t theBinContent = theMEHistogram->GetBinContent(digi->channel()) + digi->adc();
    theMEHistogram->SetBinContent(digi->channel(), theBinContent);

I don't think any kind of concurrent histogram can support this out of the box: even if GetBinContent() and SetBinContent() are fully atomic, the combination of the two is not.

If I stick to an std::recursive_mutex I can expose the locking and do something like

    auto theMEHistogram = theMonitor->getTH1F();
    [...]
    theMEHistogram->lock();
    Double_t theBinContent = theMEHistogram->GetBinContent(digi->channel()) + digi->adc();
    theMEHistogram->SetBinContent(digi->channel(), theBinContent);
    theMEHistogram->unlock();

or maybe

    auto theMEHistogram = theMonitor->getTH1F();
    [...]
    std::lock_guard<...> lock(theMEHistogram->mutex());
    Double_t theBinContent = theMEHistogram->GetBinContent(digi->channel()) + digi->adc();
    theMEHistogram->SetBinContent(digi->channel(), theBinContent);

Does anyone have a better idea ?

Dr15Jones commented 7 years ago

A templated function that takes a function as an argument. The function takes the lock, calls the functor and the releases the lock. That avoids exposing the lock.

fwyzard commented 7 years ago

OK, let me go through all the user code and check if anybody would need to lock multiple histograms at the same time...

dmitrijus commented 7 years ago

Same suggestion as @Dr15Jones said.

Another which I come up with: with this concurrent filling, what happens to lumi histograms? How are they "integrated", does it become "a snapshot" of the DQMStore, at the end of some global end lumi? Since modules will see event in parallel, the end lumi transition is no longer consistent.

I guess you could still continue to merge each module at their own end lumi....... actually, no you can't, it still inconsistent, by the time anyone reaches its own end lumi, other instances will already pre-fill the histogram for the next lumi...

Although I don't mind getting rid of lumi histograms, but you are the major user :)

rovere commented 7 years ago

Lumi histograms should be booked at the right lumi begin transition and labelled with the correct lumi numbering inside the DQMStore, eventually with module and stream number set to 0 if they have to be filled concurrently as globals.

On Tue, Aug 15, 2017, 10:18 Dmitrijus notifications@github.com wrote:

Same suggestion as @Dr15Jones https://github.com/dr15jones said.

Another which I come up with: with this concurrent filling, what happens to lumi histograms? How are they "integrated", does it become "a snapshot" of the DQMStore, at the end of global end lumi? Since modules will see event in parallel, the end lumi transition is no longer consistent.

I guess you could still continue to merge each module at their own end lumi....... actually, no you can't, it still inconsistent, by the time anyone reaches globalEndLumi, other instances will already pre-fill the histogram for the next lumi...

Although I don't mind getting rid of lumi histograms, but you are the major user :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/19395#issuecomment-322407475, or mute the thread https://github.com/notifications/unsubscribe-auth/ABHaRy3F4Crxwi-xrn7PPLvvpRvwWuaiks5sYVQtgaJpZM4OA-Ms .

-- Ciao, --Marco.


Marco Rovere Marco.Rovere@cern.ch CERN EP-CMG-CO | room 40 3-A28 | tel +41227671209 (71209)

dmitrijus commented 7 years ago

@rovere Correct. But thiis does involve a bit of engineering...

So, then it means we either split bookHistograms into two methods: bookHistograms4Run and bookHistograms4Lumi. Or split at a higher level, DQMEDAnalyzerRun or DQMEDAnalyzerLumi.

Dr15Jones commented 7 years ago

@fwyzard could the strange code you found work by using a weight instead of explicitly getting the bin content?

smuzaffar commented 11 months ago

@fwyzard , is this still an open issue?

fwyzard commented 11 months ago

ehm... I'll have to check!

bendavid commented 11 months ago

We're doing this (using boost histograms) for the mW analysis FWIW since our histograms are too big to keep one copy in memory for each thread...

makortel commented 11 months ago

assign dqm

cmsbuild commented 11 months ago

New categories assigned: dqm

@rvenditti,@syuvivida,@tjavaid,@nothingface0,@antoniovagnerini you have been requested to review this Pull request/Issue and eventually sign? Thanks

jpivarski commented 11 months ago

This is a different context—histogram filling on GPUs—but in that context, atomic-bin filling is much better than copy-and-merge: https://github.com/jpivarski-talks/2023-11-02-atlas-gpu-python-tutorial/blob/main/solutions/project-1-fill-histograms.ipynb

In CPU contexts, I think it's a toss-up; it depends on how big (in memory) the histograms are and the distribution of data filling the histograms, which in turn determines how likely they are to contend. (Remember that each bin is a separate atomic counter.)