Closed qzhang234 closed 3 years ago
I have a question: is it possible to introduce an additional attribute in EpicsSignal class called 'Essential'? If a PV is listed as non-essential, that PV's disconnection/malfunction WILL NOT crash a scan. As useful as scalars are, I don't think not being able to ping a scale PV 1 out of 10,000 times should be the cause of a long-term stability problem.
This will also allow staff scientists at different facilities to customize their own Bluesky install so that it hits everyone's own sweet spot on the entire spectrum, anywhere from 'Crash Early Crash Hard' to 'Run Even If the Sky Falls'.
What do you think, Pete @prjemian ?
8idi:scaler1.CNT
is used to tell the scaler to count and detect when the scaler has finished counting. If this is not needed in the scan, why is it being called at all?
Adding such an attribute to EpicsSignal is possible on a local level. Getting that to be used by all possible Devices that use EpicsSignal is monumental. The standard response is a question of Why is this PV causing a FailedStatus
exception? That is the problem to be fixed instead of spending weeks of time implementing an essential
attribute.
So, I do not believe that adding such an attribute is the proper way to fix this problem.
I totally agree. I'm starting to ask the question myself as WHY is the scaler even being called in the first place. I believe we (or more like I) need to go over AD_Acquire
line by line to make sure I know EXACTLY what PV's are called and remove the ones that are not necessary.
That should provide a more fundamental solution and also improve my Bluesky debugging skills.
This is the line that caused the crash. It's consistent with the error in the screen output: https://github.com/aps-8id-dys/ipython-8idiuser/blob/ad96dd733f2fbb478bbb1156cb683fea92475731/profile_bluesky/startup/instrument/plans/xpcs_acquire.py#L281 And this is the PV in the crash line. It's literally the 'Count' button on the scaler. None of the PVs on scaler1 is essential for the analysis nor grabbed by DM Workflow.
For future debugging, I would like to propose the following:
I think I now know Bluesky well enough to identify which line causes the crash, but I don't know Bluesky well enough to remove the corresponding PV in a clean manner. @prjemian Could you teach me?
Sent you a Teams invite for the morning at 10 am.
Close this for now. Will wait for AD_Acquire
for Rigaku to complete and resume Bluesky stability test. If the problem reoccurs then will submit another issue.
Bluesky crashed at 05:48 am on 01/21 after ~ 12 hours of continuous operation (11,155 out of 20,000 measurements).
The crash is caused by
8idi:scaler1.CNT
. This PV is not used by DM workflow to the best of my knowledge.The error message from the terminal is attached.
Error message from Bluesky
``` HDF5 workflow file name: /home/8ididata/2021-1/demo202101/F107_11154_att00_Test/F107_11154_att00_Test_0001-100000.hdf F107_11155_att00_Test HDF5 workflow file name: /home/8ididata/2021-1/demo202101/F107_11155_att00_Test/F107_11155_att00_Test_0001-100000.hdf F107_11156_att00_Test --------------------------------------------------------------------------- FailedStatus Traceback (most recent call last)Also attached is a screenshot from the DM workflow monitor: