cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.3k forks source link

Raw data object rarely goes missing in multi-threaded HLT jobs #29209

Closed smorovic closed 4 years ago

smorovic commented 4 years ago

During recent output module tests which involve a lot of CPU time spent in output modules (for 2021 Heavy Ion running), I have been encountering spurious errors with 4-thread and 4-stream cmsRun jobs and would like to report it.

The issue appears when a module can not find product inserted by the input source (invalid handle). This causes following error message in the L1 HLT module (found in VirginRaw HLT menu):

%MSG-e L1T:  L1TRawToDigi:hltGtStage2Digis 03-Mar-2020 18:59:42 CET  Run: 1000001680 Event: 6996208
Cannot unpack: no FEDRawDataCollection found
%MSG

(followed by other fatal exception).

It happens here: https://github.com/cms-sw/cmssw/blob/master/EventFilter/L1TRawToDigi/plugins/L1TRawToDigi.cc#L140

It is relatively rare and hard to reproduce. I was running on a scaled-down HLT farm of 12 nodes (32 cores per node, 1 proces assigned per 4 cores). In this case problem appears every 30 minutes. It also only appears when CPU load is near maximum (which is achieved using expensive LZMA compression in the output module or doing other equivalent work with similar CPU time, as I tested). It did not appear with 1-stream/thread setup. While I first saw this with CMSSW_11_0_1, further testing showed also the same problem with CMSSW_10_6_8.

I did some debugging, and it appears code in FedRawDataInputSource which inserts RawDataCollection does run (according to logged message) before this issue appears in the same event, i.e. it seems that the product goes missing in between.

cmsbuild commented 4 years ago

A new Issue was created by @smorovic Srecko Morovic.

@Dr15Jones, @smuzaffar, @silviodonato, @makortel, @davidlange6, @fabiocos can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 4 years ago

assign core, daq

cmsbuild commented 4 years ago

New categories assigned: core,daq

@Dr15Jones,@smuzaffar,@makortel,@mommsen,@emeschi you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 4 years ago

@Dr15Jones Could you take a look?

Dr15Jones commented 4 years ago

@smorovic can you post the other fatal exceptions you see? Also, is 'delete early' feature turned on in the job you are running?

smorovic commented 4 years ago

I think delete early is not used (no "canDeleteEarly" option set anywhere in configuration). For reference, here is a copy of the configuration file: /afs/cern.ch/user/s/smorovic/public/HltConfig.py_run1000001680_slc7_amd64_gcc700_CMSSW_10_6_8_TIER0_TRANSFER_OFF

All the errors and exceptions that appear:

%MSG-e L1T:  L1TRawToDigi:hltGtStage2Digis 03-Mar-2020 18:59:42 CET  Run: 1000001680 Event: 6996208
Cannot unpack: no FEDRawDataCollection found
%MSG
----- Begin Fatal Exception 03-Mar-2020 18:59:42 CET-----------------------
An exception of category 'InsertFailure' occurred while
   [0] Processing  Event run: 1000001680 lumi: 7406 event: 6996208 stream: 0
   [1] Running path 'RatesMonitoring'
   [2] Calling method for module L1TRawToDigi/'hltGtStage2Digis'
Exception Message:
Attempt to insert more than one product on branch l1tMuonBXVector_hltGtStage2Digis_Muon_HLT.
----- End Fatal Exception -------------------------------------------------
%MSG-w FastMonitoringService:  PostProcessPath 03-Mar-2020 18:59:42 CET  Run: 1000001680 Event: 6996208
 STREAM 0 earlyTermination -: ID:run: 1000001680 lumi: 7406 event: 6996208 LS:7406  FromThisContext
%MSG
Dr15Jones commented 4 years ago

@smorovic OK that exception is very weird. Do you have other ones as well? This is looking more and more like a memory overwrite.

Dr15Jones commented 4 years ago

@smorovic can you give me a recipe to reproduce the problem?

Dr15Jones commented 4 years ago

What is the configuration of L1TRawToDigi that is being used in this job?

smorovic commented 4 years ago

Here is the configuration of L1TRawToDigi.

process.hltGtStage2Digis = cms.EDProducer( "L1TRawToDigi",
    lenSlinkTrailer = cms.untracked.int32( 8 ),
    lenAMC13Header = cms.untracked.int32( 8 ),
    CTP7 = cms.untracked.bool( False ),
    lenAMC13Trailer = cms.untracked.int32( 8 ),
    Setup = cms.string( "stage2::GTSetup" ),
    MinFeds = cms.uint32( 0 ),
    InputLabel = cms.InputTag( "rawDataCollector" ),
    lenSlinkHeader = cms.untracked.int32( 8 ),
    MTF7 = cms.untracked.bool( False ),
    FWId = cms.uint32( 0 ),
    TMTCheck = cms.bool( True ),
    lenAMCTrailer = cms.untracked.int32( 0 ),
    debug = cms.untracked.bool( False ),
    FedIds = cms.vint32( 1404 ),
    lenAMCHeader = cms.untracked.int32( 8 ),
    DmxFWId = cms.uint32( 0 ),
    FWOverride = cms.bool( False )
)

I can prepare a file sample to run using a single process if you would like to use it to investigate any memory issues. However it will not be easily reproducible in this way.

The frequency of this is around 1 case in 30 minutes using 12 machines (each with 32-cores). it is a BU-FU setup. I use custom script to generate event data in a filesystem (ramdisk) and then a number of client processes are accessing it over NFS. If needed I can also provide you instructions and arrange access to the needed machines in P5 (and an account if needed).

I should mention that also a few other crashes or exception types were appearing in the original11_0_1 run with the full farm ( we did this during MWGR1 with Tracker), maybe it is also an indication of memory overwrite (but those could not be reproduced). I saved some of the dumps.

In one case:

#4  <signal handler called>
#5  0x00007f924ff9a53c in je_extent_heap_remove (ph=ph@entry=0x7f91b0208120, phn=phn@entry=0x7f91b0209800) at src/extent.c:289
#6  0x00007f924ff670e3 in arena_bin_slabs_nonfull_remove (slab=0x7f91b0209800, bin=0x7f91b02080a8) at src/arena.c:1606
#7  arena_dissociate_bin_slab (bin=0x7f91b02080a8, slab=0x7f91b0209800, arena=0x7f91b02008c0) at src/arena.c:1606
#8  arena_dalloc_bin_locked_impl (junked=true, ptr=<optimized out>, slab=0x7f91b0209800, binind=0, bin=0x7f91b02080a8, arena=0x7f91b02008c0, tsdn=0x7f91b0cf8ab8, tsdn@entry=0xffffff
ffffffffff) at src/arena.c:1666
#9  je_arena_dalloc_bin_junked_locked (tsdn=tsdn@entry=0x7f91b0cf8ab8, arena=arena@entry=0x7f91b02008c0, bin=bin@entry=0x7f91b02080a8, binind=binind@entry=16, extent=0x7f91b0209800,
 ptr=<optimized out>) at src/arena.c:1682
#10 0x00007f924ffbf9aa in je_tcache_bin_flush_small (tsd=tsd@entry=0x7f91b0cf8ab8, tcache=tcache@entry=0x7f91b0cf8ca8, tbin=tbin@entry=0x7f91b0cf8e38, binind=binind@entry=16, rem=0) at include/jemalloc/internal/tsd.h:225
#11 0x00007f924ffc1b74 in tcache_flush_cache (tcache=0x7f91b0cf8ca8, tsd=0x7f91b0cf8ab8) at src/tcache.c:517
#12 tcache_destroy (tsd=tsd@entry=0x7f91b0cf8ab8, tcache=tcache@entry=0x7f91b0cf8ca8, tsd_tcache=tsd_tcache@entry=true) at src/tcache.c:547
#13 0x00007f924ffc39b7 in je_tcache_cleanup (tsd=tsd@entry=0x7f91b0cf8ab8) at src/tcache.c:594
#14 0x00007f924ffc46d9 in tsd_do_data_cleanup (tsd=0x7f91b0cf8ab8) at src/tsd.c:368
#15 je_tsd_cleanup () at src/tsd.c:393
#16 0x00007f924ed7dbd2 in __nptl_deallocate_tsd () from /lib64/libpthread.so.0
#17 0x00007f924ed7dde3 in start_thread () from /lib64/libpthread.so.0
#18 0x00007f924eaa5ead in clone () from /lib64/libc.so.6

in the other case:

cmsRun: /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_11_0_1-slc7_amd64_gcc820/build/CMSSW_11_0_1-build/tmp/BUILDROOT/bc61c9e1c5073953c4c02af6e508045a/opt/cmssw/slc7_amd64_gcc820
/cms/cmssw/CMSSW_11_0_1/src/FWCore/Framework/src/EventProcessor.cc:1476: void edm::EventProcessor::readLuminosityBlock(edm::LuminosityBlockProcessingStatus&): Assertion `lbp' failed
.

But, surprisingly, in replaying input data (which I did in attempts to reproduce), I just always get the first one I reported, not any other crash type.

Dr15Jones commented 4 years ago

@smorovic could you make the full log of a failed job avaiable somewhere? I do not see how the failure to get the FEDRawDataCollection leads to the module then throwing that exception.

smorovic commented 4 years ago

@Dr15Jones, I copied log files for the two occurrences:

  1. run during the MWGR 1
  2. reproduced replay run (there be additional errors due to lumisection mismatch, but normal in this case as the number is faked. Log files are are found here:

In /afs/cern.ch/user/s/smorovic/public/HLT-errors/

old_hlt_run335523_pid143608.log

old_hlt_run1000001680_pid89272.log.tar.gz (compressed)

HltConfig.py_run1000001680_slc7_amd64_gcc700_CMSSW_10_6_8_TIER0_TRANSFER_OFF (configuration file, I moved if here)

Dr15Jones commented 4 years ago

So the uncompressed log has odd messages

%MSG-w EvFDaqDirector:   HLTriggerJSONMonitoring:hltHLTriggerJSONMonitoring@beginRun  20-Feb-2020 10:59:19 CET Run: 335523
Transfer system mode definitions missing for -: streamHLTRates (permissive mode)
%MSG
%MSG-e L1T:  L1TRawToDigi:hltGtStage2Digis 20-Feb-2020 11:15:09 CET  Run: 335523 Lumi: 39 Stream: 1
Cannot unpack: no FEDRawDataCollection found
%MSG
%MSG-e L1T:  L1TRawToDigi:hltGtStage2Digis 20-Feb-2020 11:15:09 CET  Run: 335523 Event: 10167644
Cannot unpack: no FEDRawDataCollection found
%MSG
----- Begin Fatal Exception 20-Feb-2020 11:15:09 CET-----------------------
An exception of category 'InsertFailure' occurred while
   [0] Processing  Event run: 335523 lumi: 40 event: 10167644 stream: 1
   [1] Running path 'RatesMonitoring'
   [2] Calling method for module L1TRawToDigi/'hltGtStage2Digis'
Exception Message:
Attempt to insert more than one product on branch l1tMuonBXVector_hltGtStage2Digis_Muon_HLT.
----- End Fatal Exception -------------------------------------------------
%MSG-w FastMonitoringService:  PostProcessPath 20-Feb-2020 11:15:09 CET  Run: 335523 Event: 10167644
 STREAM 1 earlyTermination -: ID:run: 335523 lumi: 40 event: 10167644 LS:40  FromThisContext 
%MSG

Note that the first error says it comes during processing of lumi 39 by a stream, not during the processing of an event! Do you know what the transition order that was to be expected? {E.g. what run, lumis and events.)

Dr15Jones commented 4 years ago

The compressed log file shows similar behavior before the exception

%MSG-w AuxiliaryMakers:  source 03-Mar-2020 18:59:42 CET PostProcessEvent
Lumisection mismatch, external : 7405, record : 29
%MSG
%MSG-e L1T:  L1TRawToDigi:hltGtStage2Digis 03-Mar-2020 18:59:42 CET  Run: 1000001680 Lumi: 7405 Stream: 0
Cannot unpack: no FEDRawDataCollection found
%MSG
%MSG-w AuxiliaryMakers:  source 03-Mar-2020 18:59:42 CET PostStreamBeginLumi
6996205
%MSG
%MSG-w AuxiliaryMakers:  source 03-Mar-2020 18:59:42 CET PostStreamBeginLumi
Lumisection mismatch, external : 7406, record : 29
%MSG
%MSG-w AuxiliaryMakers:  source 03-Mar-2020 18:59:42 CET AfterSource
6996206
%MSG
%MSG-w AuxiliaryMakers:  source 03-Mar-2020 18:59:42 CET AfterSource
Lumisection mismatch, external : 7406, record : 29
%MSG
%MSG-w AuxiliaryMakers:  source 03-Mar-2020 18:59:42 CET AfterSource
6996207
%MSG
%MSG-w AuxiliaryMakers:  source 03-Mar-2020 18:59:42 CET AfterSource
Lumisection mismatch, external : 7406, record : 29
%MSG
%MSG-w AuxiliaryMakers:  source 03-Mar-2020 18:59:42 CET AfterSource
6996208
%MSG
%MSG-w AuxiliaryMakers:  source 03-Mar-2020 18:59:42 CET AfterSource
Lumisection mismatch, external : 7406, record : 29
%MSG
%MSG-e L1T:  L1TRawToDigi:hltGtStage2Digis 03-Mar-2020 18:59:42 CET  Run: 1000001680 Event: 6996208
Cannot unpack: no FEDRawDataCollection found
%MSG
----- Begin Fatal Exception 03-Mar-2020 18:59:42 CET-----------------------
An exception of category 'InsertFailure' occurred while
   [0] Processing  Event run: 1000001680 lumi: 7406 event: 6996208 stream: 0
   [1] Running path 'RatesMonitoring'
   [2] Calling method for module L1TRawToDigi/'hltGtStage2Digis'
Exception Message:
Attempt to insert more than one product on branch l1tMuonBXVector_hltGtStage2Digis_Muon_HLT.
----- End Fatal Exception -------------------------------------------------

So again, what looks like the module being called during a lumi transition!

Dr15Jones commented 4 years ago

In both cases, the message logger is saying the stream is processing a transition for the previous Lumi, but the Event produce method for the module is being called!

smorovic commented 4 years ago

Yes, it can only be printed by produce, but I'm not sure how that can happen for Lumi transition?

I did not understand your question about order that was to be expected? We have one run per process, and then lumis and events (only). Currently no multiple lumis are allowed at the same time.

Dr15Jones commented 4 years ago

By order, what order of transition numbers (run number, then which lumi number and then which event numbers) the source is supposed to generate.

smorovic commented 4 years ago

I will have to run a couple more tests to see what is executed by the input source prior to the error, then I will get back.

Dr15Jones commented 4 years ago

So I'm working on the hypothesis that the Event prior to the new Lumi is marked as 'finished' before hltGtStage2Digis is run for that Event. The Event data is then cleared and only after that is hltGtStage2Digis called. The hltGtStage2Digis fails to find the data product which results in the first message which includes the missing data product and it being labeled as happening during a Lumi transition. In addition hltGtStage2Digis puts empty items in to the Event. Then once the next Event happens for that stream, attempts by hltGtStage2Digis to put the data product again would cause the exception.

HOWEVER, the above hypothesis does not explain why the FEDRawDataCollection is again missing in the Event after the Lumi transition.

Dr15Jones commented 4 years ago

The hypothesis could explain the second missing product message IF

This seems highly unlikely especially if the error message seen always mentions 'Muon' as the data product which is being added again. 'Muon' is the first data product added by the module and therefore the one to always see the message if ALL of the module's data products are already in the Event. If this were happening while data is being deleted, we might expect to see some of the other data products be mentioned. It is possible that because the deleting of data products in the Event is not intended to be thread-safe that the module's data products could have been deleted but the CPU cache has not been updated yet. Still, seems unlikely.

Dr15Jones commented 4 years ago

There is one puzzle in the output. The messages from AuxiliaryMakers sometime say they are during PostProcessEvent and sometimes during AfterSource and when there was a problem it says PostStreamBeginLumi.

smorovic commented 4 years ago

I am unable to resume tests at the moment because one of machines needed was shut down (due to coronavirus situation). I asked if possible to bring it back, so we'll see if that materializes.

update: the machine is back on, so I can continue.

This was running on dual-socket machines. Maybe it would make sense to run each processes pinned to a particular CPU socket to remove any migration over the separate L3 caches. It would likely reduce the probability of cache issues (It will require some work in the software to get the pinning part done).

Auxiliary makers message can appear only from makeEventAuxiliary, which can run only from FedRawDataInputSource::read.

Dr15Jones commented 4 years ago

So I understand about the PostProcessEvent and AfterSource behavior. The message logger does not set that field when the 'preSourceEvent` state happens. So if a message is printed during that time, what ever the last message which set that field on that particular thread will be printed. Not so helpful so I'll probably go ahead and change that in the MessageLogger.

Dr15Jones commented 4 years ago

So the

cmsRun: /root/buildArea/CMSSW_11_1_0_pre3/src/FWCore/Framework/src/EventProcessor.cc:1478: void edm::EventProcessor::readLuminosityBlock(edm::LuminosityBlockProcessingStatus&): Assertion `lbp' failed.

assert happened to me while running the test

cmsRun ${CMSSW_BASE}/src/FWCore/Framework/test/test_one_modules_cfg.py

on a machine that was fully loaded.

smorovic commented 4 years ago

I added more log information and ran over night. Got 6 errors. They all have input source "nextEvent" return new luminosity block, and immediately later there is first of the "cannot unpack" error:

%MSG-w FedRawDataInputSource:  PostProcessPath 18-Mar-2020 00:43:02 CET  PostProcessEvent
----------------NEW LUMI----------------
%MSG
%MSG-e L1T:  L1TRawToDigi:hltGtStage2Digis 18-Mar-2020 00:43:02 CET  Run: 1000001685 Lumi: 4505 Stream: 2
Cannot unpack: no FEDRawDataCollection found
%MSG

"New LUMI" log message was inserted here: https://github.com/cms-sw/cmssw/blob/master/EventFilter/Utilities/src/FedRawDataInputSource.cc#L269

Then the next "Cannot unpack" error appears on the next event handled by the input source, and finally also exception is thrown on it.

Logs of one case: /afs/cern.ch/user/s/smorovic/public/HLT-errors/20200318/err.log

I didn't print which lumi or event event in all of the messages, will add that in the next iteration.

Dr15Jones commented 4 years ago

I found a way to semi reliably cause the Assertion 'lbp' failed. So I will concentrate there for now.

Dr15Jones commented 4 years ago

I found out that if I replace the assert with a loop that re-requests the LuminosityBlock principal that eventually the job succeeds. So this is definitely a race condition based on when TBB tasks are run.

smorovic commented 4 years ago

I can try a similar trick in L1TRawToDigi to see if product is found if in a loop (maybe with limited cycles to print some error eventually).

Dr15Jones commented 4 years ago

So I can cause the assert failure 100% of the time by overloading the machine and slowing down this particular lambda https://github.com/cms-sw/cmssw/blob/master/FWCore/Framework/src/EventProcessor.cc#L1135-L1163

Do not know why yet, but working on it.

Dr15Jones commented 4 years ago

So I added printouts at the very end of that lambda containing StreamID, Run# and Lumi #. Here is what I get

beginStreamTransitionAsync called 3 8 2
beginStreamTransitionAsync called 2 8 2
Begin processing the 61st record. Run 8, Event 5, LumiSection 2 on stream 3 at 19-Mar-2020 11:34:27.778 CDT
Begin processing the 62nd record. Run 8, Event 6, LumiSection 2 on stream 2 at 19-Mar-2020 11:34:27.778 CDT
Begin processing the 63rd record. Run 8, Event 7, LumiSection 2 on stream 1 at 19-Mar-2020 11:34:27.778 CDT
Begin processing the 64th record. Run 8, Event 8, LumiSection 2 on stream 0 at 19-Mar-2020 11:34:27.778 CDT
beginStreamTransitionAsync called 0 8 2
beginStreamTransitionAsync called 1 8 2
beginStreamTransitionAsync called 0 9 1
beginStreamTransitionAsync called 1 9 1
beginStreamTransitionAsync called 2 9 1
Begin processing the 65th record. Run 9, Event 1, LumiSection 1 on stream 3 at 19-Mar-2020 11:34:27.784 CDT
Begin processing the 66th record. Run 9, Event 2, LumiSection 1 on stream 0 at 19-Mar-2020 11:34:27.784 CDT
Begin processing the 67th record. Run 9, Event 3, LumiSection 1 on stream 1 at 19-Mar-2020 11:34:27.785 CDT
Begin processing the 68th record. Run 9, Event 4, LumiSection 1 on stream 2 at 19-Mar-2020 11:34:27.785 CDT
cmsRun: /uscms_data/d2/cdj/build/temp/bug/CMSSW_11_1_0_pre4/src/FWCore/Framework/src/EventProcessor.cc:1480: void edm::EventProcessor::readLuminosityBlock(edm::LuminosityBlockProcessingStatus&): Assertion `lbp' failed.

A fatal system signal has occurred: abort signal
The following is the call stack containing the origin of the signal.
...
Current Modules:

Module: none (crashed)beginStreamTransitionAsync called 3 9 1

Module: none
Module: none
Module: none

So we see that events and then end lumi are actually called before that lambda is finished. That condition doesn't always cause a failure (see Run 8 lumi 2 printouts) but every failure does have that call.

smorovic commented 4 years ago

From my side, I experimented with adding the loop containing getByToken and validity check. I also disabled gcc optimization for that code and ended adding __sync_synchronize barrier within the loop.

Loop would break if it eventually finds the product, and would then throw a special exception if it wasn't found the first time. But it never finds it and just keeps looping infinitely (it prints something every 10m loops to catch it happening).

Maybe thus the race that causes the "no FedRawData" issue happens earlier.

Dr15Jones commented 4 years ago

The assert was fixed with #29248

Dr15Jones commented 4 years ago

@smorovic if you want to try the fix, all you need to do is check out FWCore/Framework and change FWCore/Framework/src/EventProcessor.cc line 1152-3 from

 auto lp = status->lumiPrincipal();
 event.setLuminosityBlockPrincipal(lp.get());

to

 auto lp = status->lumiPrincipal().get();
 event.setLuminosityBlockPrincipal(lp);

I believe that is sufficient to fix the assert. You also do not have to rebuild the entire CMSSW code base with just that small change.

smorovic commented 4 years ago

Thanks @Dr15Jones . I already applied full patch to FWCore/Framework (only one reject which was trivial to fix with CMSSW_11_0_1). It seems to work without having to recompile any other libraries.

It has already been running for 2 hours and no exceptions seen so far (but this is still not very improbable even with the bug). I will leave it running for a day to be certain.

Dr15Jones commented 4 years ago

I'm assuming this test would also uncover the

Cannot unpack: no FEDRawDataCollection found

problem as well?

smorovic commented 4 years ago

Unfortunately, I had two occurrences of "no FEDRawData" problem (and eventually other exception, same as previously) in last 6 hours.

After applying the patch, this is how the relevant code looks:

        auto eventSetupImpls = &status->eventSetupImpls();
        auto lp = status->lumiPrincipal().get();
        streamLumiStatus_[i] = std::move(status);
        ++streamLumiActive_;
        event.setLuminosityBlockPrincipal(lp);

and I still got errors with it.

Now I managed to increase frequency of cases to a few per minute by having 4x streams and threads, likely because of increased preemption of threads. With this maybe it is possible to have a more reproducible test on a single host. I will try to prepare it.

Dr15Jones commented 4 years ago

Glad you could trigger it faster. Did you have any of the assert failures?

smorovic commented 4 years ago

No, I could never reproduce assert problem (after original runs where they appeared).

I managed to trigger 'no FEDRawData' one even faster by having more frequent LS transitions. Last combination: with one machine only, one process with 128 threads/streams (on 32-core dual Skylake) gives a case every 5-10 minutes. I think it was even faster when running with 8 processes (x 16 threads/streams) Although I didn't run generator job on the same host. Data is still read over NFS from the server, so I have to set up something simpler.

smorovic commented 4 years ago

I packaged a test which can be run on a standalone machine.

/afs/cern.ch/user/s/smorovic/public/HLT-errors/test-hlt.tar.gz

On dual Skylake (Gold 6130 CPUs) I ran 4 processes with 8 threads and streams each and with hyperthreading off on NUMA node-0, all on 16 cores. Error usually appears within 10 minutes.

What is interesting, I also tried it on an older Westmere X5650 with similar scaling of threads vs. cores, but I did not get an error even after prolonged running. But here I did get the other one, lbp assert error (when I run it with your patch no longer appears).

Both machines have same OS/kernel: CC7.6 and kernel 3.10.0-957.1.3.el7.x86_64. I will see tomorrow if I can run it on a few more generations, although most other machines are off at the moment.

So it looks like it depends on the CPU...

Dr15Jones commented 4 years ago

Do the events where the problem occur seem to have any relationship with LuminosityBlock transitions? E.g. happen right before or right after a LuminosityBlock transition happens? Just trying to find some pattern to help guide the debugging.

Dr15Jones commented 4 years ago

I tried running your script on cmsdev31 and got a failure in the server.sh script:

scripts/fulltest/localrun_dv.sh: line 115: numactl: command not found
Dr15Jones commented 4 years ago

After reviewing previous comments in this issue, it looks like the exception always happens right after a LuminosityBlock transition. Does that still seem to be the case after adding my one fix?

smorovic commented 4 years ago

It seems numactl is not installed. But you could also try by editing the script and try to run without numactl (and parameters), unless you need to pin process to a specific node or core.

From my understanding (from a few days ago), the no "FedRawData" error appears for event that was processed before input source returned new Lumi (if module runs after input source does this). And later also appears for first event processed after the new Lumi. Exception also appears from the second event.

smorovic commented 4 years ago

I rechecked once again. Yes, this appears with your fix (the complete patch applied to CMSSW_11_0_1).

Dr15Jones commented 4 years ago

So I added

process.add_(cms.Service("Tracer", dumpPathsAndConsumes=cms.untracked.bool(True)) )

To the job to better understand what was happening. I had done the following

The Tracer job then showed a number of lumis which contained no events. Is that expected?

From the Tracer output when using CMSSW_11_0_1 I see that hltGtStage2Digis is usually the first module run in an event since the first path to start is RatesMonitoring.

So I was wondering if you could try seeing if you can cause the problem just by having one Path in the job and that Path just being hltGtStage2Digis?

Dr15Jones commented 4 years ago

So I was wondering if you could try seeing if you can cause the problem just by having one Path in the job and that Path just being hltGtStage2Digis?

I tried running such a job on cmsdev31 but I can't come close to saturating the CPUs with 6 jobs as the source can't delver transitions fast enough to the jobs. (each job is hovering at around 100% CPU althought it is using 8 threads/streams).

Dr15Jones commented 4 years ago

Hiere is another thought. Change https://github.com/cms-sw/cmssw/blob/master/EventFilter/L1TRawToDigi/plugins/L1TRawToDigi.cc#L140

to be assert(false). That way when the problem occurs, cmsRun will pause all the threads and do a stack trace. That could help determine what is happening in the job as a whole.

smorovic commented 4 years ago

Last parameter is the number of writer threads, it can be increased if needed. Although even one is sufficient in my case. I am surprised that jobs can't saturate CPU given the high cost of compression, maybe output disk writing is slow?

For other parameters, 3 => files/LS and 30 => events / file. With this it is normal that some LS-es are completely empty because not each process gets a file (I didn't do detailed tuning in this case).

scripts/fulltest/localrun_dv.sh 1000001704 ramdisk 3 30 node0 node0 1

Also there is 0.1 threshold in localrun_dv.sh which can be changed. This is % of partition to fill with files before throttling back if that creates a problem.

I will try both suggestions. Here is already assertion trace:

cmsRun: /tmp/smorovic/CMSSW_11_0_1/src/EventFilter/L1TRawToDigi/plugins/L1TRawToDigi.cc:141: virtual void l1t::L1TRawToDigi::produce(edm::Event&, const edm::EventSetup&): Assertion `0' failed.

A fatal system signal has occurred: abort signal
The following is the call stack containing the origin of the signal.

Mon Mar 23 17:12:29 CET 2020
Thread 19 (Thread 0x7fc2dffff700 (LWP 291938)):
#0  0x00007fc3759d7e2d in nanosleep () from /lib64/libc.so.6
#1  0x00007fc3759d7cc4 in sleep () from /lib64/libc.so.6
#2  0x00007fc371b59de0 in sig_pause_for_stacktrace () from /opt/offline/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_0_1/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007fc375a0b1c9 in syscall () from /lib64/libc.so.6
#5  0x00007fc376c917a5 in tbb::internal::futex_wait (comparand=2, futex=0x7fc3732b6d2c) at ../../include/tbb/machine/linux_common.h:81
#6  tbb::internal::binary_semaphore::P (this=0x7fc3732b6d2c) at ../../src/tbb/semaphore.h:205
#7  rml::internal::thread_monitor::commit_wait (c=..., this=0x7fc3732b6d20) at ../../src/tbb/../rml/server/thread_monitor.h:255
#8  tbb::internal::rml::private_worker::run (this=0x7fc3732b6d00) at ../../src/tbb/private_server.cpp:273
#9  0x00007fc376c917d9 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:219
#10 0x00007fc375ce8dd5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fc375a10ead in clone () from /lib64/libc.so.6
Thread 18 (Thread 0x7fc2e477e700 (LWP 291937)):
#0  0x00007fc3759d7e2d in nanosleep () from /lib64/libc.so.6
#1  0x00007fc3759d7cc4 in sleep () from /lib64/libc.so.6
#2  0x00007fc371b59de0 in sig_pause_for_stacktrace () from /opt/offline/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_0_1/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007fc375a0b1c9 in syscall () from /lib64/libc.so.6
#5  0x00007fc376c917a5 in tbb::internal::futex_wait (comparand=2, futex=0x7fc3732b6cac) at ../../include/tbb/machine/linux_common.h:81
#6  tbb::internal::binary_semaphore::P (this=0x7fc3732b6cac) at ../../src/tbb/semaphore.h:205
#7  rml::internal::thread_monitor::commit_wait (c=..., this=0x7fc3732b6ca0) at ../../src/tbb/../rml/server/thread_monitor.h:255
#8  tbb::internal::rml::private_worker::run (this=0x7fc3732b6c80) at ../../src/tbb/private_server.cpp:273
#9  0x00007fc376c917d9 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:219
#10 0x00007fc375ce8dd5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fc375a10ead in clone () from /lib64/libc.so.6
Thread 17 (Thread 0x7fc2e1f7a700 (LWP 291936)):
#0  0x00007fc375a0b1c9 in syscall () from /lib64/libc.so.6
#1  0x00007fc376c917a5 in tbb::internal::futex_wait (comparand=2, futex=0x7fc3732b6c2c) at ../../include/tbb/machine/linux_common.h:81
#2  tbb::internal::binary_semaphore::P (this=0x7fc3732b6c2c) at ../../src/tbb/semaphore.h:205
#3  rml::internal::thread_monitor::commit_wait (c=..., this=0x7fc3732b6c20) at ../../src/tbb/../rml/server/thread_monitor.h:255
#4  tbb::internal::rml::private_worker::run (this=0x7fc3732b6c00) at ../../src/tbb/private_server.cpp:273
#5  0x00007fc376c917d9 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:219
#6  0x00007fc375ce8dd5 in start_thread () from /lib64/libpthread.so.0
#7  0x00007fc375a10ead in clone () from /lib64/libc.so.6
Thread 16 (Thread 0x7fc2e297b700 (LWP 291935)):
#0  0x00007fc3759d7e2d in nanosleep () from /lib64/libc.so.6
#1  0x00007fc3759d7cc4 in sleep () from /lib64/libc.so.6
#2  0x00007fc371b59de0 in sig_pause_for_stacktrace () from /opt/offline/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_0_1/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007fc375a0b1c9 in syscall () from /lib64/libc.so.6
#5  0x00007fc376c917a5 in tbb::internal::futex_wait (comparand=2, futex=0x7fc3732b6b2c) at ../../include/tbb/machine/linux_common.h:81
#6  tbb::internal::binary_semaphore::P (this=0x7fc3732b6b2c) at ../../src/tbb/semaphore.h:205
#7  rml::internal::thread_monitor::commit_wait (c=..., this=0x7fc3732b6b20) at ../../src/tbb/../rml/server/thread_monitor.h:255
#8  tbb::internal::rml::private_worker::run (this=0x7fc3732b6b00) at ../../src/tbb/private_server.cpp:273
#9  0x00007fc376c917d9 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:219
#10 0x00007fc375ce8dd5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fc375a10ead in clone () from /lib64/libc.so.6
Thread 15 (Thread 0x7fc2e3d7d700 (LWP 291934)):
#0  0x00007fc375a0b1c9 in syscall () from /lib64/libc.so.6
#1  0x00007fc376c917a5 in tbb::internal::futex_wait (comparand=2, futex=0x7fc3732b6bac) at ../../include/tbb/machine/linux_common.h:81
#2  tbb::internal::binary_semaphore::P (this=0x7fc3732b6bac) at ../../src/tbb/semaphore.h:205
#3  rml::internal::thread_monitor::commit_wait (c=..., this=0x7fc3732b6ba0) at ../../src/tbb/../rml/server/thread_monitor.h:255
#4  tbb::internal::rml::private_worker::run (this=0x7fc3732b6b80) at ../../src/tbb/private_server.cpp:273
#5  0x00007fc376c917d9 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:219
#6  0x00007fc375ce8dd5 in start_thread () from /lib64/libpthread.so.0
#7  0x00007fc375a10ead in clone () from /lib64/libc.so.6
Thread 14 (Thread 0x7fc2e337c700 (LWP 291933)):
#0  0x00007fc375a0b1c9 in syscall () from /lib64/libc.so.6
#1  0x00007fc376c917a5 in tbb::internal::futex_wait (comparand=2, futex=0x7fc3732b6e2c) at ../../include/tbb/machine/linux_common.h:81
#2  tbb::internal::binary_semaphore::P (this=0x7fc3732b6e2c) at ../../src/tbb/semaphore.h:205
#3  rml::internal::thread_monitor::commit_wait (c=..., this=0x7fc3732b6e20) at ../../src/tbb/../rml/server/thread_monitor.h:255
#4  tbb::internal::rml::private_worker::run (this=0x7fc3732b6e00) at ../../src/tbb/private_server.cpp:273
#5  0x00007fc376c917d9 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:219
#6  0x00007fc375ce8dd5 in start_thread () from /lib64/libpthread.so.0
#7  0x00007fc375a10ead in clone () from /lib64/libc.so.6
Thread 13 (Thread 0x7fc2e517f700 (LWP 291932)):
#0  0x00007fc3759d7e2d in nanosleep () from /lib64/libc.so.6
#1  0x00007fc3759d7cc4 in sleep () from /lib64/libc.so.6
#2  0x00007fc371b59de0 in sig_pause_for_stacktrace () from /opt/offline/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_0_1/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007fc375a0b1c9 in syscall () from /lib64/libc.so.6
#5  0x00007fc376c917a5 in tbb::internal::futex_wait (comparand=2, futex=0x7fc3732b6dac) at ../../include/tbb/machine/linux_common.h:81
#6  tbb::internal::binary_semaphore::P (this=0x7fc3732b6dac) at ../../src/tbb/semaphore.h:205
#7  rml::internal::thread_monitor::commit_wait (c=..., this=0x7fc3732b6da0) at ../../src/tbb/../rml/server/thread_monitor.h:255
#8  tbb::internal::rml::private_worker::run (this=0x7fc3732b6d80) at ../../src/tbb/private_server.cpp:273
#9  0x00007fc376c917d9 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:219
#10 0x00007fc375ce8dd5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fc375a10ead in clone () from /lib64/libc.so.6
Thread 12 (Thread 0x7fc2e7fff700 (LWP 291893)):
#0  0x00007fc375cecd12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fc3713e39aa in FedRawDataInputSource::readSupervisor() () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libEventFilterUtilities.so
#2  0x00007fc3762d5ccf in execute_native_thread_routine () at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
#3  0x00007fc375ce8dd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fc375a10ead in clone () from /lib64/libc.so.6
Thread 11 (Thread 0x7fc34a9fe700 (LWP 291834)):
#0  0x00007fc3759d7e2d in nanosleep () from /lib64/libc.so.6
#1  0x00007fc3759d7cc4 in sleep () from /lib64/libc.so.6
#2  0x00007fc371b59de0 in sig_pause_for_stacktrace () from /opt/offline/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_0_1/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007fc375a0b1c9 in syscall () from /lib64/libc.so.6
#5  0x00007fc376c917a5 in tbb::internal::futex_wait (comparand=2, futex=0x7fc3732b6eac) at ../../include/tbb/machine/linux_common.h:81
#6  tbb::internal::binary_semaphore::P (this=0x7fc3732b6eac) at ../../src/tbb/semaphore.h:205
#7  rml::internal::thread_monitor::commit_wait (c=..., this=0x7fc3732b6ea0) at ../../src/tbb/../rml/server/thread_monitor.h:255
#8  tbb::internal::rml::private_worker::run (this=0x7fc3732b6e80) at ../../src/tbb/private_server.cpp:273
#9  0x00007fc376c917d9 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:219
#10 0x00007fc375ce8dd5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fc375a10ead in clone () from /lib64/libc.so.6
Thread 10 (Thread 0x7fc34c5fd700 (LWP 291830)):
#0  0x00007fc3759d7e2d in nanosleep () from /lib64/libc.so.6
#1  0x00007fc3759d7cc4 in sleep () from /lib64/libc.so.6
#2  0x00007fc371b59de0 in sig_pause_for_stacktrace () from /opt/offline/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_0_1/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007fc375a0b1c9 in syscall () from /lib64/libc.so.6
#5  0x00007fc376c917a5 in tbb::internal::futex_wait (comparand=2, futex=0x7fc3732b70ac) at ../../include/tbb/machine/linux_common.h:81
#6  tbb::internal::binary_semaphore::P (this=0x7fc3732b70ac) at ../../src/tbb/semaphore.h:205
#7  rml::internal::thread_monitor::commit_wait (c=..., this=0x7fc3732b70a0) at ../../src/tbb/../rml/server/thread_monitor.h:255
#8  tbb::internal::rml::private_worker::run (this=0x7fc3732b7080) at ../../src/tbb/private_server.cpp:273
#9  0x00007fc376c917d9 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:219
#10 0x00007fc375ce8dd5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fc375a10ead in clone () from /lib64/libc.so.6
Thread 9 (Thread 0x7fc34b7ff700 (LWP 291829)):
#0  0x00007fc3759d7e2d in nanosleep () from /lib64/libc.so.6
#1  0x00007fc3759d7cc4 in sleep () from /lib64/libc.so.6
#2  0x00007fc371b59de0 in sig_pause_for_stacktrace () from /opt/offline/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_0_1/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007fc375a0b1c9 in syscall () from /lib64/libc.so.6
#5  0x00007fc376c917a5 in tbb::internal::futex_wait (comparand=2, futex=0x7fc3732b6fac) at ../../include/tbb/machine/linux_common.h:81
#6  tbb::internal::binary_semaphore::P (this=0x7fc3732b6fac) at ../../src/tbb/semaphore.h:205
#7  rml::internal::thread_monitor::commit_wait (c=..., this=0x7fc3732b6fa0) at ../../src/tbb/../rml/server/thread_monitor.h:255
#8  tbb::internal::rml::private_worker::run (this=0x7fc3732b6f80) at ../../src/tbb/private_server.cpp:273
#9  0x00007fc376c917d9 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:219
#10 0x00007fc375ce8dd5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fc375a10ead in clone () from /lib64/libc.so.6
Thread 8 (Thread 0x7fc34cffe700 (LWP 291823)):
#0  0x00007fc3759d7e2d in nanosleep () from /lib64/libc.so.6
#1  0x00007fc3759d7cc4 in sleep () from /lib64/libc.so.6
#2  0x00007fc371b59de0 in sig_pause_for_stacktrace () from /opt/offline/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_0_1/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007fc375a0b1c9 in syscall () from /lib64/libc.so.6
#5  0x00007fc376c917a5 in tbb::internal::futex_wait (comparand=2, futex=0x7fc3732b6f2c) at ../../include/tbb/machine/linux_common.h:81
#6  tbb::internal::binary_semaphore::P (this=0x7fc3732b6f2c) at ../../src/tbb/semaphore.h:205
#7  rml::internal::thread_monitor::commit_wait (c=..., this=0x7fc3732b6f20) at ../../src/tbb/../rml/server/thread_monitor.h:255
#8  tbb::internal::rml::private_worker::run (this=0x7fc3732b6f00) at ../../src/tbb/private_server.cpp:273
#9  0x00007fc376c917d9 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:219
#10 0x00007fc375ce8dd5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fc375a10ead in clone () from /lib64/libc.so.6
Thread 7 (Thread 0x7fc34d9ff700 (LWP 291822)):
#0  0x00007fc3759d7e2d in nanosleep () from /lib64/libc.so.6
#1  0x00007fc3759d7cc4 in sleep () from /lib64/libc.so.6
#2  0x00007fc371b59de0 in sig_pause_for_stacktrace () from /opt/offline/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_0_1/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007fc375a0b1c9 in syscall () from /lib64/libc.so.6
#5  0x00007fc376c917a5 in tbb::internal::futex_wait (comparand=2, futex=0x7fc3732b702c) at ../../include/tbb/machine/linux_common.h:81
#6  tbb::internal::binary_semaphore::P (this=0x7fc3732b702c) at ../../src/tbb/semaphore.h:205
#7  rml::internal::thread_monitor::commit_wait (c=..., this=0x7fc3732b7020) at ../../src/tbb/../rml/server/thread_monitor.h:255
#8  tbb::internal::rml::private_worker::run (this=0x7fc3732b7000) at ../../src/tbb/private_server.cpp:273
#9  0x00007fc376c917d9 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:219
#10 0x00007fc375ce8dd5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fc375a10ead in clone () from /lib64/libc.so.6
Thread 6 (Thread 0x7fc34edfe700 (LWP 291819)):
#0  0x00007fc3759d7e2d in nanosleep () from /lib64/libc.so.6
#1  0x00007fc3759d7cc4 in sleep () from /lib64/libc.so.6
#2  0x00007fc371b59de0 in sig_pause_for_stacktrace () from /opt/offline/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_0_1/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007fc375a0b1c9 in syscall () from /lib64/libc.so.6
#5  0x00007fc376c917a5 in tbb::internal::futex_wait (comparand=2, futex=0x7fc3732b71ac) at ../../include/tbb/machine/linux_common.h:81
#6  tbb::internal::binary_semaphore::P (this=0x7fc3732b71ac) at ../../src/tbb/semaphore.h:205
#7  rml::internal::thread_monitor::commit_wait (c=..., this=0x7fc3732b71a0) at ../../src/tbb/../rml/server/thread_monitor.h:255
#8  tbb::internal::rml::private_worker::run (this=0x7fc3732b7180) at ../../src/tbb/private_server.cpp:273
#9  0x00007fc376c917d9 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:219
#10 0x00007fc375ce8dd5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fc375a10ead in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7fc34f7ff700 (LWP 291817)):
#0  0x00007fc3759d7e2d in nanosleep () from /lib64/libc.so.6
#1  0x00007fc3759d7cc4 in sleep () from /lib64/libc.so.6
#2  0x00007fc371b59de0 in sig_pause_for_stacktrace () from /opt/offline/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_0_1/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007fc375a0b1c9 in syscall () from /lib64/libc.so.6
#5  0x00007fc376c917a5 in tbb::internal::futex_wait (comparand=2, futex=0x7fc3732b712c) at ../../include/tbb/machine/linux_common.h:81
#6  tbb::internal::binary_semaphore::P (this=0x7fc3732b712c) at ../../src/tbb/semaphore.h:205
#7  rml::internal::thread_monitor::commit_wait (c=..., this=0x7fc3732b7120) at ../../src/tbb/../rml/server/thread_monitor.h:255
#8  tbb::internal::rml::private_worker::run (this=0x7fc3732b7100) at ../../src/tbb/private_server.cpp:273
#9  0x00007fc376c917d9 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:219
#10 0x00007fc375ce8dd5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fc375a10ead in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7fc35058b700 (LWP 291700)):
#0  0x00007fc3759d7e2d in nanosleep () from /lib64/libc.so.6
#1  0x00007fc3759d7cc4 in sleep () from /lib64/libc.so.6
#2  0x00007fc371405869 in evf::FastMonitoringService::dowork() () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libEventFilterUtilities.so
#3  0x00007fc3762d5ccf in execute_native_thread_routine () at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
#4  0x00007fc375ce8dd5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc375a10ead in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7fc353fff700 (LWP 291659)):
#0  0x00007fc375cec965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fc3762d05ec in __gthread_cond_wait (__mutex=<optimized out>, __cond=<optimized out>) at /build/cmsbld/auto-builds/CMSSW_10_6_0_pre4-slc7_amd64_gcc820/build/CMSSW_10_6_0_pre4-build/BUILD/slc7_amd64_gcc820/external/gcc/8.2.0-pafccj/gcc-8.2.0/obj/x86_64-unknown-linux-gnu/libstdc++-v3/include/x86_64-unknown-linux-gnu/bits/gthr-default.h:864
#2  std::condition_variable::wait (this=<optimized out>, __lock=...) at ../../../../../libstdc++-v3/src/c++11/condition_variable.cc:53
#3  0x00007fc3713e0519 in FedRawDataInputSource::readWorker(unsigned int) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libEventFilterUtilities.so
#4  0x00007fc3762d5ccf in execute_native_thread_routine () at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
#5  0x00007fc375ce8dd5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007fc375a10ead in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7fc35f91a700 (LWP 291624)):
#0  0x00007fc375cf0179 in waitpid () from /lib64/libpthread.so.0
#1  0x00007fc371b59f97 in edm::service::cmssw_stacktrace_fork() () from /opt/offline/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_0_1/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#2  0x00007fc371b5aa5a in edm::service::InitRootHandlers::stacktraceHelperThread() () from /opt/offline/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_0_1/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#3  0x00007fc3762d5ccf in execute_native_thread_routine () at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
#4  0x00007fc375ce8dd5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc375a10ead in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7fc373ede4c0 (LWP 291606)):
#0  0x00007fc375a0620d in poll () from /lib64/libc.so.6
#1  0x00007fc371b5a3ff in full_read.constprop () from /opt/offline/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_0_1/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#2  0x00007fc371b5ab3c in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_0_1/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#3  0x00007fc371b5ba19 in sig_dostack_then_abort () from /opt/offline/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_0_1/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007fc375949207 in raise () from /lib64/libc.so.6
#6  0x00007fc37594a8f8 in abort () from /lib64/libc.so.6
#7  0x00007fc375942026 in __assert_fail_base () from /lib64/libc.so.6
#8  0x00007fc3759420d2 in __assert_fail () from /lib64/libc.so.6
#9  0x00007fc351eaa1a8 in l1t::L1TRawToDigi::produce(edm::Event&, edm::EventSetup const&) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/pluginEventFilterL1TRawToDigiAuto.so
#10 0x00007fc3785eb956 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventPrincipal const&, edm::EventSetupImpl const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#11 0x00007fc37850cb83 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventPrincipal const&, edm::EventSetupImpl const&, edm::ModuleCallingContext const*) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#12 0x00007fc3784b4c8a in decltype ({parm#1}()) edm::convertException::wrap<bool edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetupImpl const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(bool edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetupImpl const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#13 0x00007fc3784b4e5e in bool edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetupImpl const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#14 0x00007fc3784b665b in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetupImpl const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#15 0x00007fc3784b7674 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#16 0x00007fc376c9c27d in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop (this=this@entry=0x7fc373580e00, context_guard=..., t=t@entry=0x7fc37347e040, isolation=isolation@entry=0) at ../../include/tbb/task.h:992
#17 0x00007fc376c9c575 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7fc373580e00, parent=..., child=<optimized out>) at ../../include/tbb/task.h:992
#18 0x00007fc37855eb85 in edm::EventProcessor::processLumis(std::shared_ptr<void> const&) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#19 0x00007fc3785664de in edm::EventProcessor::runToCompletion() () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#20 0x0000000000410c7b in main::{lambda()#1}::operator()() const ()
#21 0x000000000040f2e2 in main ()

Current Modules:

Module: L1TRawToDigi:hltGtStage2Digis (crashed)
Module: none
Module: none
Module: none
Module: none
Module: none
Module: none
Module: none
Module: none
smorovic commented 4 years ago

I repeated it (this time also I removed the log):

cmsRun: /tmp/smorovic/CMSSW_11_0_1/src/EventFilter/L1TRawToDigi/plugins/L1TRawToDigi.cc:141: virtual void l1t::L1TRawToDigi::produce(edm::Event&, const edm::EventSetup&): Assertion `0' failed.

A fatal system signal has occurred: abort signal
The following is the call stack containing the origin of the signal.

Mon Mar 23 17:20:54 CET 2020
Thread 12 (Thread 0x7f89a367f700 (LWP 294725)):
#0  0x00007f8a35dcad12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f8a314c19aa in FedRawDataInputSource::readSupervisor() () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libEventFilterUtilities.so
#2  0x00007f8a363b2ccf in execute_native_thread_routine () at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
#3  0x00007f8a35dc6dd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f8a35aeeead in clone () from /lib64/libc.so.6
Thread 11 (Thread 0x7f8a0abff700 (LWP 294621)):
#0  0x00007f8a35ab5e2d in nanosleep () from /lib64/libc.so.6
#1  0x00007f8a35ab5cc4 in sleep () from /lib64/libc.so.6
#2  0x00007f8a31c37de0 in sig_pause_for_stacktrace () from /opt/offline/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_0_1/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007f8a387f2c68 in edm::TypeWithDict::typeInfo() const () from /opt/offline/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreReflection.so
#5  0x00007f8a3140d8fc in edm::StreamSerializer::serializeEvent(SerializeDataBuffer&, edm::EventForOutput const&, edm::Hash<1> const&, edm::StreamerCompressionAlgo, int, unsigned int) const () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libIOPoolStreamer.so
#6  0x00007f8a31419c13 in edm::StreamerOutputModuleCommon::serializeEvent(SerializeDataBuffer&, edm::EventForOutput const&, edm::Handle<edm::TriggerResults> const&, edm::Hash<1> const&) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libIOPoolStreamer.so
#7  0x00007f8a314cb65b in evf::EvFOutputModule::write(edm::EventForOutput const&) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libEventFilterUtilities.so
#8  0x00007f8a386c4c9f in edm::one::OutputModuleBase::doEvent(edm::EventPrincipal const&, edm::EventSetupImpl const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#9  0x00007f8a385e8543 in edm::WorkerT<edm::one::OutputModuleBase>::implDo(edm::EventPrincipal const&, edm::EventSetupImpl const&, edm::ModuleCallingContext const*) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#10 0x00007f8a38590c8a in decltype ({parm#1}()) edm::convertException::wrap<bool edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetupImpl const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(bool edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetupImpl const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#11 0x00007f8a38590e5e in bool edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetupImpl const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#12 0x00007f8a3859265b in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetupImpl const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#13 0x00007f8a38593cc5 in void edm::SerialTaskQueueChain::actionToRun<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute()::{lambda()#1}&>(edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute()::{lambda()#1}&) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#14 0x00007f8a38593d81 in edm::SerialTaskQueue::QueuedTask<void edm::SerialTaskQueueChain::push<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute()::{lambda()#1}&>(edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute()::{lambda()#1}&)::{lambda()#1}>::execute() () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#15 0x00007f8a36d7827d in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop (this=this@entry=0x7f8a33473e00, context_guard=..., t=0x7f8a3347e640, t@entry=0x7f8a33475c40, isolation=isolation@entry=0) at ../../include/tbb/task.h:992
#16 0x00007f8a36d78575 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7f8a33473e00, parent=..., child=<optimized out>) at ../../include/tbb/task.h:992
#17 0x00007f8a36d7235f in tbb::internal::arena::process (this=0x7f8a335c3480, s=...) at ../../src/tbb/arena.cpp:152
#18 0x00007f8a36d70ed3 in tbb::internal::market::process (this=0x7f8a335e3580, j=...) at ../../src/tbb/market.cpp:667
#19 0x00007f8a36d6d5cc in tbb::internal::rml::private_worker::run (this=0x7f8a33306e80) at ../../src/tbb/private_server.cpp:266
#20 0x00007f8a36d6d7d9 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:219
#21 0x00007f8a35dc6dd5 in start_thread () from /lib64/libpthread.so.0
#22 0x00007f8a35aeeead in clone () from /lib64/libc.so.6
Thread 10 (Thread 0x7f8a0bdfc700 (LWP 294620)):
#0  0x00007f8a35ab5e2d in nanosleep () from /lib64/libc.so.6
#1  0x00007f8a35ab5cc4 in sleep () from /lib64/libc.so.6
#2  0x00007f8a31c37de0 in sig_pause_for_stacktrace () from /opt/offline/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_0_1/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007f8a35ae91c9 in syscall () from /lib64/libc.so.6
#5  0x00007f8a36d6d7a5 in tbb::internal::futex_wait (comparand=2, futex=0x7f8a33306fac) at ../../include/tbb/machine/linux_common.h:81
#6  tbb::internal::binary_semaphore::P (this=0x7f8a33306fac) at ../../src/tbb/semaphore.h:205
#7  rml::internal::thread_monitor::commit_wait (c=..., this=0x7f8a33306fa0) at ../../src/tbb/../rml/server/thread_monitor.h:255
#8  tbb::internal::rml::private_worker::run (this=0x7f8a33306f80) at ../../src/tbb/private_server.cpp:273
#9  0x00007f8a36d6d7d9 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:219
#10 0x00007f8a35dc6dd5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007f8a35aeeead in clone () from /lib64/libc.so.6
Thread 9 (Thread 0x7f8a0c7fd700 (LWP 294619)):
#0  0x00007f8a35ab5e2d in nanosleep () from /lib64/libc.so.6
#1  0x00007f8a35ab5cc4 in sleep () from /lib64/libc.so.6
#2  0x00007f8a31c37de0 in sig_pause_for_stacktrace () from /opt/offline/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_0_1/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007f8a35ad3d47 in sched_yield () from /lib64/libc.so.6
#5  0x00007f8a36d77465 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (this=0x7f8a334b3e00, completion_ref_count=<optimized out>, isolation=0) at ../../src/tbb/mailbox.h:214
#6  0x00007f8a36d722be in tbb::internal::arena::process (this=0x7f8a335c3480, s=...) at ../../include/tbb/task.h:992
#7  0x00007f8a36d70ed3 in tbb::internal::market::process (this=0x7f8a335e3580, j=...) at ../../src/tbb/market.cpp:667
#8  0x00007f8a36d6d5cc in tbb::internal::rml::private_worker::run (this=0x7f8a33307080) at ../../src/tbb/private_server.cpp:266
#9  0x00007f8a36d6d7d9 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:219
#10 0x00007f8a35dc6dd5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007f8a35aeeead in clone () from /lib64/libc.so.6
Thread 8 (Thread 0x7f8a0d1fe700 (LWP 294618)):
#0  0x00007f8a35ab5e2d in nanosleep () from /lib64/libc.so.6
#1  0x00007f8a35ab5cc4 in sleep () from /lib64/libc.so.6
#2  0x00007f8a31c37de0 in sig_pause_for_stacktrace () from /opt/offline/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_0_1/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007f8a36d0ae5a in ?? () from /opt/offline/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_0_1/external/slc7_amd64_gcc820/lib/libz.so.1
#5  0x00007f8a36d0c962 in ?? () from /opt/offline/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_0_1/external/slc7_amd64_gcc820/lib/libz.so.1
#6  0x00007f8a36d0d532 in deflate () from /opt/offline/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_0_1/external/slc7_amd64_gcc820/lib/libz.so.1
#7  0x00007f8a36d16755 in compress2 () from /opt/offline/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_0_1/external/slc7_amd64_gcc820/lib/libz.so.1
#8  0x00007f8a3140bc9e in edm::StreamSerializer::compressBuffer(unsigned char*, unsigned int, std::vector<unsigned char, std::allocator<unsigned char> >&, int, unsigned int) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libIOPoolStreamer.so
#9  0x00007f8a3140de0b in edm::StreamSerializer::serializeEvent(SerializeDataBuffer&, edm::EventForOutput const&, edm::Hash<1> const&, edm::StreamerCompressionAlgo, int, unsigned int) const () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libIOPoolStreamer.so
#10 0x00007f8a31419c13 in edm::StreamerOutputModuleCommon::serializeEvent(SerializeDataBuffer&, edm::EventForOutput const&, edm::Handle<edm::TriggerResults> const&, edm::Hash<1> const&) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libIOPoolStreamer.so
#11 0x00007f8a314cb65b in evf::EvFOutputModule::write(edm::EventForOutput const&) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libEventFilterUtilities.so
#12 0x00007f8a386c4c9f in edm::one::OutputModuleBase::doEvent(edm::EventPrincipal const&, edm::EventSetupImpl const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#13 0x00007f8a385e8543 in edm::WorkerT<edm::one::OutputModuleBase>::implDo(edm::EventPrincipal const&, edm::EventSetupImpl const&, edm::ModuleCallingContext const*) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#14 0x00007f8a38590c8a in decltype ({parm#1}()) edm::convertException::wrap<bool edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetupImpl const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(bool edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetupImpl const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#15 0x00007f8a38590e5e in bool edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetupImpl const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#16 0x00007f8a3859265b in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetupImpl const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#17 0x00007f8a38593cc5 in void edm::SerialTaskQueueChain::actionToRun<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute()::{lambda()#1}&>(edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute()::{lambda()#1}&) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#18 0x00007f8a38593d81 in edm::SerialTaskQueue::QueuedTask<void edm::SerialTaskQueueChain::push<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute()::{lambda()#1}&>(edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute()::{lambda()#1}&)::{lambda()#1}>::execute() () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#19 0x00007f8a36d7827d in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop (this=this@entry=0x7f8a334c3e00, context_guard=..., t=0x7f8a3347e540, t@entry=0x7f8a0e5a5b40, isolation=isolation@entry=0) at ../../include/tbb/task.h:992
#20 0x00007f8a36d78575 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7f8a334c3e00, parent=..., child=<optimized out>) at ../../include/tbb/task.h:992
#21 0x00007f8a36d7235f in tbb::internal::arena::process (this=0x7f8a335c3480, s=...) at ../../src/tbb/arena.cpp:152
#22 0x00007f8a36d70ed3 in tbb::internal::market::process (this=0x7f8a335e3580, j=...) at ../../src/tbb/market.cpp:667
#23 0x00007f8a36d6d5cc in tbb::internal::rml::private_worker::run (this=0x7f8a33306f00) at ../../src/tbb/private_server.cpp:266
#24 0x00007f8a36d6d7d9 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:219
#25 0x00007f8a35dc6dd5 in start_thread () from /lib64/libpthread.so.0
#26 0x00007f8a35aeeead in clone () from /lib64/libc.so.6
Thread 7 (Thread 0x7f8a0ddff700 (LWP 294617)):
#0  0x00007f8a35ab5e2d in nanosleep () from /lib64/libc.so.6
#1  0x00007f8a35ab5cc4 in sleep () from /lib64/libc.so.6
#2  0x00007f8a31c37de0 in sig_pause_for_stacktrace () from /opt/offline/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_0_1/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007f8a35ad3d47 in sched_yield () from /lib64/libc.so.6
#5  0x00007f8a36d77465 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (this=0x7f8a3347be00, completion_ref_count=<optimized out>, isolation=0) at ../../src/tbb/mailbox.h:214
#6  0x00007f8a36d722be in tbb::internal::arena::process (this=0x7f8a335c3480, s=...) at ../../include/tbb/task.h:992
#7  0x00007f8a36d70ed3 in tbb::internal::market::process (this=0x7f8a335e3580, j=...) at ../../src/tbb/market.cpp:667
#8  0x00007f8a36d6d5cc in tbb::internal::rml::private_worker::run (this=0x7f8a33307000) at ../../src/tbb/private_server.cpp:266
#9  0x00007f8a36d6d7d9 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:219
#10 0x00007f8a35dc6dd5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007f8a35aeeead in clone () from /lib64/libc.so.6
Thread 6 (Thread 0x7f8a0effe700 (LWP 294616)):
#0  0x00007f8a35ab5e2d in nanosleep () from /lib64/libc.so.6
#1  0x00007f8a35ab5cc4 in sleep () from /lib64/libc.so.6
#2  0x00007f8a31c37de0 in sig_pause_for_stacktrace () from /opt/offline/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_0_1/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007f8a35ae91c9 in syscall () from /lib64/libc.so.6
#5  0x00007f8a36d6d7a5 in tbb::internal::futex_wait (comparand=2, futex=0x7f8a333071ac) at ../../include/tbb/machine/linux_common.h:81
#6  tbb::internal::binary_semaphore::P (this=0x7f8a333071ac) at ../../src/tbb/semaphore.h:205
#7  rml::internal::thread_monitor::commit_wait (c=..., this=0x7f8a333071a0) at ../../src/tbb/../rml/server/thread_monitor.h:255
#8  tbb::internal::rml::private_worker::run (this=0x7f8a33307180) at ../../src/tbb/private_server.cpp:273
#9  0x00007f8a36d6d7d9 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:219
#10 0x00007f8a35dc6dd5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007f8a35aeeead in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7f8a0f9ff700 (LWP 294615)):
#0  0x00007f8a35ae420d in poll () from /lib64/libc.so.6
#1  0x00007f8a31c383ff in full_read.constprop () from /opt/offline/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_0_1/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#2  0x00007f8a31c38b3c in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_0_1/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#3  0x00007f8a31c39a19 in sig_dostack_then_abort () from /opt/offline/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_0_1/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f8a35a27207 in raise () from /lib64/libc.so.6
#6  0x00007f8a35a288f8 in abort () from /lib64/libc.so.6
#7  0x00007f8a35a20026 in __assert_fail_base () from /lib64/libc.so.6
#8  0x00007f8a35a200d2 in __assert_fail () from /lib64/libc.so.6
#9  0x00007f8a11f890c7 in l1t::L1TRawToDigi::produce(edm::Event&, edm::EventSetup const&) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/pluginEventFilterL1TRawToDigiAuto.so
#10 0x00007f8a386c7956 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventPrincipal const&, edm::EventSetupImpl const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#11 0x00007f8a385e8b83 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventPrincipal const&, edm::EventSetupImpl const&, edm::ModuleCallingContext const*) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#12 0x00007f8a38590c8a in decltype ({parm#1}()) edm::convertException::wrap<bool edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetupImpl const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(bool edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetupImpl const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#13 0x00007f8a38590e5e in bool edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetupImpl const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#14 0x00007f8a3859265b in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetupImpl const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#15 0x00007f8a38593674 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#16 0x00007f8a36d7827d in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop (this=this@entry=0x7f8a334bbe00, context_guard=..., t=t@entry=0x7f8a334cd640, isolation=isolation@entry=0) at ../../include/tbb/task.h:992
#17 0x00007f8a36d78575 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7f8a334bbe00, parent=..., child=<optimized out>) at ../../include/tbb/task.h:992
#18 0x00007f8a36d7235f in tbb::internal::arena::process (this=0x7f8a335c3480, s=...) at ../../src/tbb/arena.cpp:152
#19 0x00007f8a36d70ed3 in tbb::internal::market::process (this=0x7f8a335e3580, j=...) at ../../src/tbb/market.cpp:667
#20 0x00007f8a36d6d5cc in tbb::internal::rml::private_worker::run (this=0x7f8a33307100) at ../../src/tbb/private_server.cpp:266
#21 0x00007f8a36d6d7d9 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:219
#22 0x00007f8a35dc6dd5 in start_thread () from /lib64/libpthread.so.0
#23 0x00007f8a35aeeead in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7f8a1066a700 (LWP 294614)):
#0  0x00007f8a35ab5e2d in nanosleep () from /lib64/libc.so.6
#1  0x00007f8a35ab5cc4 in sleep () from /lib64/libc.so.6
#2  0x00007f8a314e3869 in evf::FastMonitoringService::dowork() () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libEventFilterUtilities.so
#3  0x00007f8a363b2ccf in execute_native_thread_routine () at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
#4  0x00007f8a35dc6dd5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f8a35aeeead in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f8a141ff700 (LWP 294613)):
#0  0x00007f8a35dca965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f8a363ad5ec in __gthread_cond_wait (__mutex=<optimized out>, __cond=<optimized out>) at /build/cmsbld/auto-builds/CMSSW_10_6_0_pre4-slc7_amd64_gcc820/build/CMSSW_10_6_0_pre4-build/BUILD/slc7_amd64_gcc820/external/gcc/8.2.0-pafccj/gcc-8.2.0/obj/x86_64-unknown-linux-gnu/libstdc++-v3/include/x86_64-unknown-linux-gnu/bits/gthr-default.h:864
#2  std::condition_variable::wait (this=<optimized out>, __lock=...) at ../../../../../libstdc++-v3/src/c++11/condition_variable.cc:53
#3  0x00007f8a314be519 in FedRawDataInputSource::readWorker(unsigned int) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libEventFilterUtilities.so
#4  0x00007f8a363b2ccf in execute_native_thread_routine () at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
#5  0x00007f8a35dc6dd5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f8a35aeeead in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f8a1f9f8700 (LWP 294606)):
#0  0x00007f8a35dce179 in waitpid () from /lib64/libpthread.so.0
#1  0x00007f8a31c37f97 in edm::service::cmssw_stacktrace_fork() () from /opt/offline/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_0_1/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#2  0x00007f8a31c38a5a in edm::service::InitRootHandlers::stacktraceHelperThread() () from /opt/offline/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_0_1/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#3  0x00007f8a363b2ccf in execute_native_thread_routine () at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
#4  0x00007f8a35dc6dd5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f8a35aeeead in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f8a33fbc4c0 (LWP 294564)):
#0  0x00007f8a35ab5e2d in nanosleep () from /lib64/libc.so.6
#1  0x00007f8a35ab5cc4 in sleep () from /lib64/libc.so.6
#2  0x00007f8a31c37de0 in sig_pause_for_stacktrace () from /opt/offline/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_0_1/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007f8a35ae91c9 in syscall () from /lib64/libc.so.6
#5  0x00007f8a36d6d411 in tbb::internal::futex_wakeup_one (futex=0x7f8a3330712c) at ../../include/tbb/machine/linux_common.h:90
#6  tbb::internal::binary_semaphore::V (this=0x7f8a3330712c) at ../../src/tbb/semaphore.h:214
#7  tbb::internal::binary_semaphore::V (this=0x7f8a3330712c) at ../../src/tbb/semaphore.h:211
#8  rml::internal::thread_monitor::notify (this=0x7f8a33307120) at ../../src/tbb/../rml/server/thread_monitor.h:240
#9  tbb::internal::rml::private_worker::wake_or_launch (this=0x7f8a33307100) at ../../src/tbb/private_server.cpp:312
#10 tbb::internal::rml::private_server::wake_some(int) () at ../../src/tbb/private_server.cpp:395
#11 0x00007f8a36d75512 in tbb::internal::generic_scheduler::local_spawn(tbb::task*, tbb::task*&) () at ../../src/tbb/scheduler.cpp:712
#12 0x00007f8a3859042a in void edm::Worker::doWorkAsync<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::WaitingTask*, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetupImpl const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#13 0x00007f8a386957aa in edm::Path::runNextWorkerAsync(unsigned int, edm::EventPrincipal const&, edm::EventSetupImpl const&, edm::ServiceToken const&, edm::StreamID const&, edm::StreamContext const*) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#14 0x00007f8a38696b64 in edm::Path::workerFinished(std::__exception_ptr::exception_ptr const*, unsigned int, edm::EventPrincipal const&, edm::EventSetupImpl const&, edm::ServiceToken const&, edm::StreamID const&, edm::StreamContext const*) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#15 0x00007f8a38696c8c in edm::FunctorWaitingTask<edm::Path::runNextWorkerAsync(unsigned int, edm::EventPrincipal const&, edm::EventSetupImpl const&, edm::ServiceToken const&, edm::StreamID const&, edm::StreamContext const*)::{lambda(std::__exception_ptr::exception_ptr const*)#1}>::execute() () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#16 0x00007f8a36d7827d in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop (this=this@entry=0x7f8a335d0e00, context_guard=..., t=t@entry=0x7f8a334cd340, isolation=isolation@entry=0) at ../../include/tbb/task.h:992
#17 0x00007f8a36d78575 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7f8a335d0e00, parent=..., child=<optimized out>) at ../../include/tbb/task.h:992
#18 0x00007f8a3863ab85 in edm::EventProcessor::processLumis(std::shared_ptr<void> const&) () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#19 0x00007f8a386424de in edm::EventProcessor::runToCompletion() () from /tmp/smorovic/CMSSW_11_0_1/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#20 0x0000000000410c7b in main::{lambda()#1}::operator()() const ()
#21 0x000000000040f2e2 in main ()

Current Modules:

Module: L1TRawToDigi:hltGtStage2Digis (crashed)
Module: none
Module: EvFOutputModule:hltOutputPhysicsVirginRaw7
Module: none
Module: none
Module: none
Module: EvFOutputModule:hltOutputPhysicsVirginRaw4
Module: none

A fatal system signal has occurred: abort signal
Dr15Jones commented 4 years ago

Unfortunately that particular traceback doesn't show anything else going on of interest. It often takes many attempts with tracebacks for one of them to successfully catch another thread at the right time. So if you could continue, that would be great.

One thing that looks weird to me is why does TBB have so many threads? It is managing thread 1, 5-11, 13-19. That is 7 more than it should since this job only is supposed to use 8 threads?