HCALRunControl / levelOneHCALFM

HCAL Function Manager
https://twiki.cern.ch/twiki/bin/view/CMS/HCALFunctionManager
0 stars 6 forks source link

Protect runinfoPublish during resetAction #377

Closed kakwok closed 6 years ago

kakwok commented 6 years ago

When one of the xdaq failed to init and user tries to use resetAction, the failed xDAQ will gives a null pointer:

     // find all RunInfoServers controlled by this FM and acquire the information
          for (QualifiedResource qr : functionManager.containerhcalRunInfoServer.getApplications() ) {
            RISR.acquire((XdaqApplication)qr);
          }

A simple isEmpty() check should be able to protect agains this


     ERROR EventHandler: callback method resetAction failure
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at rcms.fm.fw.user.UserEventHandler.invokeActions(UserEventHandler.java:237)
at rcms.fm.fw.user.UserEventHandler.handleEvent(UserEventHandler.java:220)
at rcms.fm.fw.EventHandlerManager.handleEvent(EventHandlerManager.java:131)
at rcms.fm.fw.EventProcessor.processEvent(EventProcessor.java:100)
at rcms.fm.fw.FunctionManager.processEvent(FunctionManager.java:1253)
at rcms.fm.fw.FunctionManager.access$500(FunctionManager.java:51)
at rcms.fm.fw.FunctionManager$ConsumerThread.run(FunctionManager.java:1200)
Caused by: java.lang.NullPointerException
at rcms.fm.app.level1.HCALEventHandler$RunInfoServerReader.acquire(HCALEventHandler.java:1831)
at rcms.fm.app.level1.HCALEventHandler.publishRunInfoSummaryfromXDAQ(HCALEventHandler.java:1102)
at rcms.fm.app.level1.HCALlevelTwoEventHandler.resetAction(HCALlevelTwoEventHandler.java:317)```
kakwok commented 6 years ago

Emptyness check was already in place. Need more investigation. The offending line is this one: https://github.com/HCALRunControl/levelOneHCALFM/blob/18.1.1/src/rcms/fm/app/level1/HCALEventHandler.java#L1831 Seems like XDAQ query problem.

jhakala commented 6 years ago

I've tried reproducing this in every way I could think of, and I can't reproduce it. Without knowing exactly the steps that led to this situation, it will be tough to diagnose. Since this is unlikely to occur again and is by no means a "show-stopper," I will change the label to low priority.

kakwok commented 6 years ago

Adding more information about the bug, 1) the runinfo server executive was the executive that crashed 2) the chain of events that led to it was: configured->Reset->InitXDAQ() -> failed[port conflict] -> Reset-> null pointer

The root cause is that QR containers are reset at the end of initXDAQ, which is never reached in case qg.init() failed. Therefore, in the 2nd reset, the runinfo container is not null (left over from configured state) but points to a dead xdaq.