BS died and core dumped when WAXS detector IOC crashed

jilavsky commented 2 years ago

This may (or may not) have been provoked by WAXS detector needing restart. SOmehow Dectirs software died.

But this is what I see in command line:

In [1]: CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "usaxsserver.xray.aps.anl.gov:41882"
    Source File: ../cac.cpp line 1237
epicsThread: Unexpected C++ exception "epicsTime_localtime failed" with type "St11logic_error" in thread "CAC-TCP-send" at <UKN DATE>
terminate called after throwing an instance of 'std::logic_error'
  what():  epicsTime_localtime failed
/home/beams/USAXS/bin/blueskyUSAXS: line 16: 37440 Aborted                 (core dumped) ipython ${IPYTHON_OPTIONS}
(/share1/bluesky/conda_envs/bluesky_2022_1) usaxs@usaxscontrol .../USAXS_data/2022-02 $ (/share1/bluesky/conda_envs/bluesky_2022_1) usaxs@usaxscontrol .../USAXS_data/2022-02 $ (/share1/bluesky/conda_envs/bluesky_2022_1) usaxs@usaxscontrol .../USAXS_data/2022-02 $ (/share1/bluesky/conda_envs/bluesky_2022_1) usaxs@usaxscontrol .../USAXS_data/2022-02 $ (/share1/bluesky/conda_envs/bluesky_2022_1) usaxs@usaxscontrol .../USAXS_data/2022-02 $ 
/home/beams/USAXS/bin/blueskyUSAXS: line 16: 37440 Aborted                 (core dumped) ipython ${IPYTHON_OPTIONS}
(/share1/bluesky/conda_envs/bluesky_2022_1) usaxs@usaxscontrol .../USAXS_data/2022-02 $ (/share1/bluesky/conda_envs/bluesky_2022_1) usaxs@usaxscontrol .../USAXS_data/2022-02 $ (/share1/bluesky/conda_envs/bluesky_2022_1) usaxs@usaxscontrol .../USAXS_data/2022-02 $ (/share1/bluesky/conda_envs/bluesky_2022_1) usaxs@usaxscontrol .../USAXS_data/2022-02 $ (/share1/bluesky/conda_envs/bluesky_2022_1) usaxs@usaxscontrol .../USAXS_data/2022-02 $ fatal: Not a git repository (or any parent up to mount point /share1)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
(/share1/bluesky/conda_envs/bluesky_2022_1) usaxs@usaxscontrol .../USAXS_data/2022-02 $ (/share1/bluesky/conda_envs/bluesky_2022_1) usaxs@usaxscontrol .../USAXS_data/2022-02 $ (/share1/bluesky/conda_envs/bluesky_2022_1) usaxs@usaxscontrol .../USAXS_data/2022-02 $ (/share1/bluesky/conda_envs/bluesky_2022_1) usaxs@usaxscontrol .../USAXS_data/2022-02

prjemian commented 2 years ago

What software version? The error came from usaxsserver so this might not be the WAXS detector, but rather some IOC running on usaxsserver.

prjemian commented 2 years ago

Asked for EPICS help from Andrew Johnson, this reply:

On 2/1/22 4:15 PM, Pete Jemian wrote:

While using a Pilatus 200kw detector (for WAXS) with Bluesky today, the python application crashed with a core dump and this message to the Python process:

epicsThread: Unexpected C++ exception "epicsTime_localtime failed" with type "St11logic_error" in thread "CAC-TCP-send"

Do you understand this? Not off the top of my head, but after doing some research I think I can explain it...

The "full" report is on GitHub: https://github.com/APS-USAXS/ipython-usaxs/issues/546

So the part of the message before the exception in the "full report" is a notification printed by the CA client library that a server has disconnected. The client library uses exceptions internally and it evidently generated one at line 1237 in the cac.cpp source file when that disconnect happened, and the code which caught that exception was in the middle of printing out the details, but while trying to print the current date/time an error was returned by the glibc routine localtime_r() which the epicsTime_localtime() routine calls. This actually occurred twice, the EPICS last-chance exception handler tries to print the current date/time and it also failed, resulting in the "" at the end of the exception message.

My guess is that the machine running this CA client program had a problem with its daylight savings timezone settings, or with its idea of the current wall-clock time, since the localtime_r() routine converts a timestamp into its individual parts for the local timezone and relies on the timezone database to know how to do that.

I'm confident that EPICS wasn't the original cause of whatever problem was occurring, this was just a symptom of something else happening in the OS. There is probably still be a bug in libCa, the CA library getting an error from an OS routine shouldn't cause the application to get terminated like it did, but restarting the machine would probably have been needed anyway to recover.

HTH,

Andrew

-- Complexity comes for free, Simplicity you have to work for.

anjohnson commented 2 years ago

I filed a bug report against EPICS Base, but this isn't going to be a high priority to get fixed, and as I said above I think your main problem was elsewhere, the EPICS crash was just a symptom caused by other issues.

prjemian commented 2 years ago

Can you check if the Pilatus computer has a battery inside it? Might need to be replaced. It is old enough. Might be a CR2032. Could this be a root cause of abnormal behavior of that computer?

jilavsky commented 2 years ago

BS died again and dumped the core, it really should not be this picky. BS is NOT doing anything ! It has not been used at all since we finished running on Monday morning.
We powered down Pilatus300kw computer and removed it from hutch for repairs. Soft ioc disappeared.

In [37]: CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "s9-pilatus300k.xray.aps.anl.gov:5064"
    Source File: ../cac.cpp line 1237
epicsThread: Unexpected C++ exception "epicsTime_localtime failed" with type "St11logic_error" in thread "CAC-TCP-send" at <UKN DATE>
terminate called after throwing an instance of 'std::logic_error'
  what():  epicsTime_localtime failed
/home/beams/USAXS/bin/blueskyUSAXS: line 16: 11527 Aborted                 (core dumped) ipython ${IPYTHON_OPTIONS}
usaxs@usaxscontrol .../USAXS_data/2022-02 $

Again, BS is NOT doing anything, users are running spec all week long. Yet for some reason BS finds out soft ioc is removed and that causes core dump? Is BS using time from pIlatus 300kw computer?

prjemian commented 2 years ago

Keep the core file. It could help diagnose.

On Sun, Feb 13, 2022, 8:54 PM Jan Ilavsky @.***> wrote:

BS died again and dumped the core, it really should not be this picky. BS is NOT doing anything ! It has not been used at all since we finished running on Monday morning. We powered down Pilatus300kw computer and removed it from hutch for repairs. Soft ioc disappeared.

In [37]: CA.Client.Exception............................................... Warning: "Virtual circuit disconnect" Context: "s9-pilatus300k.xray.aps.anl.gov:5064" Source File: ../cac.cpp line 1237 epicsThread: Unexpected C++ exception "epicsTime_localtime failed" with type "St11logic_error" in thread "CAC-TCP-send" at terminate called after throwing an instance of 'std::logic_error' what(): epicsTime_localtime failed /home/beams/USAXS/bin/blueskyUSAXS: line 16: 11527 Aborted (core dumped) ipython ${IPYTHON_OPTIONS} @.*** .../USAXS_data/2022-02 $

Again, BS is NOT doing anything, users are running spec all week long. Yet for some reason it finds out soft ioc is removed and that causes core dump? Is BS using time from pIlatus 300kw computer? Why does ti care that the sofioc is not there? That is common that soft ioc are dead.

— Reply to this email directly, view it on GitHub https://github.com/APS-USAXS/ipython-usaxs/issues/546#issuecomment-1038572544, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARMUMCYKB5WCQDTBJUHST3U3BVGJANCNFSM5NJ3XA6A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were assigned.Message ID: @.***>

jilavsky commented 2 years ago

Observation: I started BS in the same terminal again and all seemed to run... EXCEPT, when I tried to collect SAXS/WAXS data, folders and locations which were suppose to be created by newUser were wrong. Instead to 02_15_Rabe/... data were saved on 2022-02_saxs and _waxs. This was AFTER I restarted BS, so something in environment is hanging and corruption future BS sessions. I created new terminal and started BS there and all was OK.

jilavsky commented 2 years ago

BTW: do not see core in current running directory, was it dumped elsewhere?

prjemian commented 2 years ago

MIA:

bash-4.2$ pwd
/share1/USAXS_data
bash-4.2$ ll -R | grep core
-rw------- 1 usaxs 3048574976 Jan 28  2021 core

prjemian commented 2 years ago

There is no core file for this situation.

prjemian commented 2 years ago

here's the log file from that session:

# IPython log file

# Tue, 15 Feb 2022 08:09:44
newUser("Newbloom")
#[Out]# '/share1/USAXS_data/2022-02/02_15_Newbloom'
# Tue, 15 Feb 2022 08:09:55
newUser("Rabe")
#[Out]# '/share1/USAXS_data/2022-02/02_15_Rabe'
# Tue, 15 Feb 2022 08:10:18
RE(preUSAXStune())
#[Out]# ('4ffba633-8887-4b27-aa21-3f8ba32eca46',
#[Out]#  '9f7303c7-acb4-4770-8327-f2e94173321c',
#[Out]#  '1f940ead-5811-41b1-bfe7-f2040e6733c9')
# Tue, 15 Feb 2022 08:11:02
RE(preUSAXStune())
#[Out]# ('2f78eebd-8086-48d4-9f08-78e25e281f7b',
#[Out]#  '66e549a1-2e74-41c0-9e8a-949b67de8ef9',
#[Out]#  '4f5d7109-1f1c-4fc3-ab3e-ea1c1d6ca41b')
# Tue, 15 Feb 2022 08:25:54
RE(preUSAXStune())
#[Out]# ('ca064747-b4e3-4d1b-b0f8-9554d652f505',
#[Out]#  'e1df2a2a-fcf9-4711-bdf1-735f250c7543',
#[Out]#  '4024a3f9-9ac0-4bff-9798-5c512ed74a92')
# Tue, 15 Feb 2022 08:34:37
RE(mode_Radiography())
#[Out]# ()
# Tue, 15 Feb 2022 08:39:12
RE(run_command_file("usaxs.mac"))
#[Out]# ('aa081c6f-5241-4d02-b192-462978faab28',
#[Out]#  'e367e787-d61c-4ccb-b2db-b6190c1ea1ef',
#[Out]#  'd5465074-46f3-493e-abac-854878e520da',
#[Out]#  '7c74970c-0ffd-43ce-9c1e-f81e7257c27b',
#[Out]#  'cc923d9c-5c25-474b-b05a-daf96bdbb8e5',
#[Out]#  '9cffd79e-2ec9-4421-9812-b7cca1d49924')
# Tue, 15 Feb 2022 08:42:35
newUser("Rabe")
#[Out]# '/share1/USAXS_data/2022-02/02_15_Rabe'
# Tue, 15 Feb 2022 08:43:04
RE(run_command_file("usaxs.mac"))
#[Out]# ('efab8a7d-e7d9-4015-87c4-0d733f3eb924',)
# Tue, 15 Feb 2022 08:43:42
exit()

New session starts similarly:

bash-4.2$ more ipython_console.log
# IPython log file

# Tue, 15 Feb 2022 08:44:48
newUser("Rabe")
#[Out]# '/share1/USAXS_data/2022-02/02_15_Rabe'
# Tue, 15 Feb 2022 08:44:53
RE(run_command_file("usaxs.mac"))
#[Out]# ('498e3b82-435b-4032-9686-2dd031dc8a27',
#[Out]#  '62267289-1d5f-4ac0-b332-ce4cdf32aa33',
...

prjemian commented 2 years ago

Testing the simplest thing first: Does Bluesky crash if area detector IOC is stopped?

Start with a running instance of ADSimDetector, then start bluesky and connect to it.
Acquire image from non-bluesky GUI
Acquire image from bluesky

In [1]: RE(bp.count([adsimdet], md=dict(purpose="IOC testing")))

Transient Scan ID: 7267     Time: 2022-02-15 12:47:55
Persistent Unique Scan ID: '1c82aa20-4e1d-4497-8df5-acae255b3065'
New stream: 'baseline'
New stream: 'primary'
+-----------+------------+
|   seq_num |       time |
+-----------+------------+
|         1 | 12:47:56.3 |
+-----------+------------+
generator count ['1c82aa20'] (scan num: 7267)
Out[1]: ('1c82aa20-4e1d-4497-8df5-acae255b3065',)

stop the area detector IOC and see response in bluesky console:

In [2]: CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "zap.144.168.192.in-addr.arpa:36319"
    Source File: ../cac.cpp line 1237
    Current Time: Tue Feb 15 2022 12:49:10.983537957
..................................................................
In [2]: 

In [2]: adsimdet.connected
Out[2]: False

restart IOC, wait a minute, and see response in bluesky console:

In [4]: adsimdet.connected
Out[4]: True

Summary

Bluesky is robust to detect if IOC is not available, then later becomes available again. Reconnects automatically.

Problem is somewhere else than the code that was just tested.

prjemian commented 2 years ago

Again, BS is NOT doing anything, users are running spec all week long. Yet for some reason BS finds out soft ioc is removed and that causes core dump?

No.

Is BS using time from pIlatus 300kw computer?

No. BS uses the datetime from the BS computer (usaxs).

But it is obvious that BS was a casualty (it crashed) when the pilatus computer had a problem with time. The question here is: How did that happen and how can we keep it from happening again? Ever.

tacaswell commented 2 years ago

Which version of pyepics/epics-base do you have in this environment?

Can you run with faulthandler (https://docs.python.org/3/library/faulthandler.html)?

If uncatchable (python) exceptions make it to the Qt eventloop PyQt will take out the process (because something is very wrong).

Can you try pinning back to the previous version of pyepics (before the NOLIBCA code path was removed and hence it is correctly using epics-base to get the epics libraries).

prjemian commented 2 years ago

usaxs@usaxscontrol /share1/USAXS_data $ become_bluesky
(/share1/bluesky/conda_envs/bluesky_2022_1) usaxs@usaxscontrol /share1/USAXS_data $ conda list epics
# packages in environment at /share1/bluesky/conda_envs/bluesky_2022_1:
#
# Name                    Version                   Build  Channel
epics-base                7.0.5.0              h295b1ef_0    conda-forge
pyepics                   3.5.0            py39hf3d152e_1    conda-forge

prjemian commented 2 years ago

Which version is the NOLIBCA one? Perhaps https://github.com/pyepics/pyepics/releases/tag/3.4.2?

prjemian commented 2 years ago

Will try with PyEpics 3.4.3

tacaswell commented 2 years ago

https://github.com/pyepics/pyepics/commit/5d36b265890b4fb698cdc3b2bccc66a7e4daffd0 is the unreviewed (pushed directly to master) commit where Matt removed NOLIBCA (functionality that we rely on) with no warning, discussion, documentation, or API change note.

The first release it is included in in pyepics 3.5.0.

prjemian commented 2 years ago

Context from bluesky Nikea channel on Slack:

prjemian commented 2 years ago

@newville: Do you have any suggestions how to defend our code against being crashed by an IOC with such a problem as triggered this issue?

newville commented 2 years ago

@prjemian @anjohnson @jilavsky @tacaswell Hm, sorry for the trouble. I don't know why a "Virtual circuit disconnect" would crash Python. I think that should not happen. PVs disconnect all the time (especially from detectors!).

I kind of doubt that any changes in pyepics 3.5.0 vs 3.4.3 would cause that, but I would be interested to hear if it did!

You could try setting PYEPICS_LIBCA to load a different version of libca. The version used by default comes from epics base V3.16.2 and was built with CentOS 6.9 with gcc 4.4.7, glibc 2.12. I can definitely believe it is time to update to a version built from epics base v7 and CentoOS 7, though I would want to recheck the oldest supported Debian versions and make sure that CentOS 7 was actually "as old as needed".

FWIW, NOLIBCA was an install-time option (which is a bit of an oddity for python packages) that did not install any versions of libca.so. That left the exact location of the version of libca.so loaded at run time hard to predict. Setting PYEPICS_LIBCA is always clear, and the extra install shared libraries are not going to be accidentally found and loaded by any other applications. So, if you want to not use the supplied version of libca.so, you sort of need to set PYEPICS_LIBCA, in which case NOLIBCA didn't really do much good anyway. That is for any problem solved by using NOLIBCA, setting PYEPICS_LIBCA will also solve that problem, and in a more explicit manner.

anjohnson commented 2 years ago

@newville an email of mine that was quoted earlier in this thread said:

So the part of the message before the exception in the "full report" is a notification printed by the CA client library that a server has disconnected. The client library uses exceptions internally and it evidently generated one at line 1237 in the cac.cpp source file when that disconnect happened, and the code which caught that exception was in the middle of printing out the details, but while trying to print the current date/time an error was returned by the glibc routine localtime_r() which the epicsTime_localtime() routine calls. This actually occurred twice, the EPICS last-chance exception handler tries to print the current date/time and it also failed, resulting in the <UKN DATE> at the end of the exception message.

My guess is that the machine running this CA client program had a problem with its daylight savings timezone settings, or with its idea of the current wall-clock time, since the localtime_r() routine converts a timestamp into its individual parts for the local timezone and relies on the timezone database to know how to do that.

Thus I agree with klauer that there is a bug in the libca code that causes this uncaught C++ exception that kills the process, but it can only happen if the OS has a problem such that glibc's localtime_r() routine is unable to convert a timestamp into the appropriate broken-down time and instead returns an error. It might be that the process is configured for a locale which is incomplete — is it set for an unusual (non-US) locale, language or timezone? Look for the environment variable LANG or any env var's whose names start LC_ especially LC_TIME. A $HOME/.locale directory could also be a source of the problem.

prjemian commented 2 years ago

@newville, @anjohnson Thanks to both of you. We can investigate the Pilatus computer for these, and also for why we cannot find the core file.

newville commented 2 years ago

@anjohnson Thanks for the explanation!

prjemian commented 2 years ago

You could try setting PYEPICS_LIBCA to load a different version of libca. The version used by default comes from epics base V3.16.2 and was built with CentOS 6.9 with gcc 4.4.7, glibc 2.12. I can definitely believe it is time to update to a version built from epics base v7 and CentoOS 7, though I would want to recheck the oldest supported Debian versions and make sure that CentOS 7 was actually "as old as needed".

@newville: We've already been using a different version:

(/share1/bluesky/conda_envs/bluesky_2022_1) bash-4.2$ ipython
Python 3.9.7 (default, Sep 16 2021, 13:09:58) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.29.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import epics

In [2]: epics.ca.find_libca()
Out[2]: '/share1/bluesky/conda_envs/bluesky_2022_1/epics/lib/linux-x86_64/libca.so'

In [3]: epics.ca.version()
Out[3]: '4.13'

In [5]: os.environ["PYEPICS_LIBCA"]
Out[5]: '/share1/bluesky/conda_envs/bluesky_2022_1/epics/lib/linux-x86_64/libca.so'

The version of libca in use was installed from the epics-base package:

(/share1/bluesky/conda_envs/bluesky_2022_1) bash-4.2$ conda list epics
# packages in environment at /share1/bluesky/conda_envs/bluesky_2022_1:
#
# Name                    Version                   Build  Channel
epics-base                7.0.5.0              h295b1ef_0    conda-forge
pyepics                   3.5.0            py39hf3d152e_1    conda-forge

prjemian commented 2 years ago

So, at issue here is a problem started on the Pilatus host workstation which causes an uncaught C++ exception that kills the libca process (and the core dump is missing). In turn, this causes a process (on a different workstation) running PyEpics to fail which then causes ipython to fail.

newville commented 2 years ago

@prjemian did you come to any conclusion about what is happening here? Is this related to something with timezone or locale setting? That is, I couldn't tell if this was resolved or is still happening, and I'm wondering if there is something we can do in pyepics to make those crashes not happen.

prjemian commented 2 years ago

@newville: No conclusions yet. Since the client code we're using is all Python (except for libca) on a different workstation, it would be a major benefit if the Python code was more robust against this type of library problem. Is that possible?

newville commented 2 years ago

@prjemian I would sure hope so! Crashing a python CA client (or any 'host process') because of a problem with CA seems like a very serious problem to me, especially for something as common as a PV disconnect or IOC going down. I would call that "unacceptable". To be clear, it is also very unusual - like, I never see it, and detector iocs restart all the time. I don't think I've ever heard of this happening before (though I do sometimes see that clients for areaDetectors need a restart if the image size can be determined: probably a coding error for not handling all the metadata well - and that's not a crash of the process).

If libca is causing fatal errors internally, I don't know what checks can be added in Python code. But if there is something we could do, we definitely should do that. There was a hint that this might be related to timezone or reporting of the error messages from libca. Would it work to force the locale (maybe to "C") and/or the timezone?

prjemian commented 2 years ago

Today is the first day I can actually work with the detector's computer, now that users are done using it for operations. I'll look at that soon today.

anjohnson commented 2 years ago

Hi Pete, I don't understand how an uncaught C++ exception (from an internal libca thread, which will cause that process using libca to call std::terminate() and hence will abort that process) can cause another process running PyEpics on another workstation to crash. How are these two processes communicating? If PyEpics is involved on the second workstation doesn't that imply the first must be running a CA server which PyEpics has connected to?

Can you relate your description to the error message at the top of this thread please? That shows a CA connection (I assume libca there is running on the machine usaxscontrol ) to a PV on an IOC (or some other CA server) running on the machine usaxsserver. The time or locale problems that I think are triggering this would be on the usaxscontrol machine if I've read that properly.

prjemian commented 2 years ago

On this issue, we have two reports of the same type of failure (original message at the top, and this comment). A different pilatus detector workstation in each case (s9pil.xray.aps.anl.gov and s9-pilatus300k.xray.aps.anl.gov). Each of these runs its own EPICS area detector IOC. A single workstation, usaxscontrol.xray.aps.anl.gov runs a Python process with PyEpics and Bluesky libraries (amongst others). This Python process is what we call Bluesky.

In the second report, @jilavsky reports that the Pilatus computer was not powered on at the time of the process crash on usaxscontrol. (That pinpoints the problem to usaxscontrol, not the pilatus computers, right?)

Both crashes reported this text:

In [1]: CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "usaxsserver.xray.aps.anl.gov:41882"
    Source File: ../cac.cpp line 1237
epicsThread: Unexpected C++ exception "epicsTime_localtime failed" with type "St11logic_error" in thread "CAC-TCP-send" at <UKN DATE>
terminate called after throwing an instance of 'std::logic_error'
  what():  epicsTime_localtime failed

Are all these lines of output from the same code? Not finding Unexpected C++ exception (or code that formats that text) in PyEpics, bluesky, or ophyd.

prjemian commented 2 years ago

Locale on usaxsserver:

usaxs@usaxscontrol $ hostname
usaxscontrol.xray.aps.anl.gov
usaxs@usaxscontrol $ uname -a
Linux usaxscontrol.xray.aps.anl.gov 3.10.0-1160.49.1.el7.x86_64 #1 SMP Tue Nov 9 16:09:48 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
usaxs@usaxscontrol $ grep "model name" /proc/cpuinfo
model name  : Intel(R) Xeon(R) W-2123 CPU @ 3.60GHz
usaxs@usaxscontrol $ ll $HOME/.locale
ls: cannot access /home/beams/USAXS/.locale: No such file or directory
usaxs@usaxscontrol $ echo $LANG
en_US.UTF-8

prjemian commented 2 years ago

clarify the workstations and IOCs:

detector computers, each with an IOC s9pil (SAXS Pilatus, IOC=usaxs_pilatus1:) s9-pilatus300k (WAXS Pilatus, IOC=usaxs_pilatus2:)

client computer: running Python with PyEpics (which loads libca+PyEpics+ophyd+bluesky), ophyd, and bluesky

prjemian commented 2 years ago

@anjohnson : You wrote:

My guess is that the machine running this CA client program had a problem with its daylight savings timezone settings, or with its idea of the current wall-clock time, since the localtime_r() routine converts a timestamp into its individual parts for the local timezone and relies on the timezone database to know how to do that.

All of this would be on the client computer, usaxscontrol (since in the second case, the Pilatus workstation was powered off). How to test your guess? Such a problem would be easily discovered, it seems.

anjohnson commented 2 years ago

That error text definitely comes from libca, presumably running in the BlueSky process running on usaxscontrol. The "Virtual circuit disconnect" and the context value mean that a CA server (probably an IOC) on usaxsserver which it was connected to has died or stopped responding. However the error that caused the crash occurred on usaxscontrol. When the internal CA thread "CAC-TCP-send" called EpicsTime::strftime() the routine epicsTime_localtime() in libCom it got back an error, which the thread wasn't expecting to get and as a result it died.

Do you know what was happening that could have caused the crash? How heavily loaded is that machine?

I can create a version of libCa that is protected against that particular error, I'll send you a patch file for it shortly, but I don't recommend you just install that and stop there, you could see other problems if this is caused by some interaction between PyEpics, BlueSky and libCa.

Can you try to find out what PVs BlueSky was connecting to and see if you can replicate the problem with a camonitor running on usaxscontrol and connected to the same PVs (or to some of them at least) at the same time. If you can replicate the problem in another CA client that points to it being an EPICS problem; if not it might be an issue with how BlueSky and PyEpics are calling the CA library, or something else going on in BlueSky which is affecting the CA threads.

anjohnson commented 2 years ago

Here's the patch to EPICS 7, although the same changes should apply to the equivalent file in most earlier releases.

diff --git a/modules/ca/src/client/ca_client_context.cpp b/modules/ca/src/client/ca_client_context.cpp
index 671d1210b..1e830fd08 100644
--- a/modules/ca/src/client/ca_client_context.cpp
+++ b/modules/ca/src/client/ca_client_context.cpp
@@ -392,9 +392,19 @@ void ca_client_context :: vSignal (
     }

     epicsTime current = epicsTime::getCurrent ();
-    char date[64];
-    current.strftime ( date, sizeof ( date ), "%a %b %d %Y %H:%M:%S.%f");
-    this->printFormated ( "    Current Time: %s\n", date );
+    try {
+        char date[64];
+        current.strftime ( date, sizeof ( date ), "%a %b %d %Y %H:%M:%S.%f");
+        this->printFormated ( "    Current Time: %s\n", date );
+    }
+    catch ( std::exception & except ) {
+        errlogPrintf (
+            "CA client library thread \"%s\" caught C++ exception \"%s\"\n",
+            epicsThreadGetNameSelf (), except.what () );
+        epicsTimeStamp now = current;
+        this->printFormated ( "    Current Time: %u.%u\n",
+                now.secPastEpoch, now.nsec );
+    }

     /*
      *  Terminate execution if unsuccessful

prjemian commented 2 years ago

@anjohnson Thanks! I'll apply that to a new download of wget https://epics.anl.gov/download/base/base-7.0.6.1.tar.gz on a local build.

prjemian commented 2 years ago

Installed the patched libca. Before the first acquisition with the saxs_det:

In [3]: epics.ca.find_libca()
Out[3]: '/local/epics/base-7.0.6.1/lib/linux-x86_64/libca.so'

In [4]: len(epics.ca._cache[epics.ca.current_context()])
Out[4]: 2804

In [6]: [k for k in epics.ca._cache[epics.ca.current_context()].keys() if k.startswith(saxs_det.prefix)]
Out[6]: 
['usaxs_pilatus1:cam1:Acquire_RBV',
 'usaxs_pilatus1:cam1:Acquire',
 'usaxs_pilatus1:HDF1:PluginType_RBV']

prjemian commented 2 years ago

After acquisition with bluesky:

In [7]: RE(bp.count([saxs_det]))

Transient Scan ID: 670     Time: 2022-02-22 15:48:33
Persistent Unique Scan ID: '750fb127-103a-4a91-a749-3cee5a779a18'
New stream: 'baseline'
New stream: 'aps_current_monitor'
New stream: 'primary'                                                                                                                       
+-----------+------------+
|   seq_num |       time |
+-----------+------------+
|         1 | 15:48:38.3 |
+-----------+------------+
generator count ['750fb127'] (scan num: 670)
Out[7]: ('750fb127-103a-4a91-a749-3cee5a779a18',)

In [8]: [k for k in epics.ca._cache[epics.ca.current_context()].keys() if k.startswith(saxs_det.prefix)]
Out[8]: 
['usaxs_pilatus1:cam1:Acquire_RBV',
 'usaxs_pilatus1:cam1:Acquire',
 'usaxs_pilatus1:HDF1:PluginType_RBV',
 'usaxs_pilatus1:cam1:AcquirePeriod_RBV',
 'usaxs_pilatus1:cam1:AcquirePeriod',
 'usaxs_pilatus1:cam1:AcquireTime_RBV',
 'usaxs_pilatus1:cam1:AcquireTime',
 'usaxs_pilatus1:cam1:ImageMode_RBV',
 'usaxs_pilatus1:cam1:ImageMode',
 'usaxs_pilatus1:cam1:Manufacturer_RBV',
 'usaxs_pilatus1:cam1:Model_RBV',
 'usaxs_pilatus1:cam1:NumExposures_RBV',
 'usaxs_pilatus1:cam1:NumExposures',
 'usaxs_pilatus1:cam1:NumImages_RBV',
 'usaxs_pilatus1:cam1:NumImages',
 'usaxs_pilatus1:cam1:TriggerMode_RBV',
 'usaxs_pilatus1:cam1:TriggerMode',
 'usaxs_pilatus1:cam1:PortName_RBV',
 'usaxs_pilatus1:image1:PluginType_RBV',
 'usaxs_pilatus1:image1:EnableCallbacks_RBV',
 'usaxs_pilatus1:image1:EnableCallbacks',
 'usaxs_pilatus1:image1:BlockingCallbacks_RBV',
 'usaxs_pilatus1:image1:BlockingCallbacks',
 'usaxs_pilatus1:cam1:ArrayCallbacks_RBV',
 'usaxs_pilatus1:cam1:ArrayCallbacks',
 'usaxs_pilatus1:image1:PortName_RBV',
 'usaxs_pilatus1:image1:NDArrayPort_RBV',
 'usaxs_pilatus1:image1:NDArrayPort',
 'usaxs_pilatus1:HDF1:ArraySize2_RBV',
 'usaxs_pilatus1:HDF1:ArraySize1_RBV',
 'usaxs_pilatus1:HDF1:ArraySize0_RBV',
 'usaxs_pilatus1:HDF1:FileName_RBV',
 'usaxs_pilatus1:HDF1:FileName',
 'usaxs_pilatus1:HDF1:FilePath_RBV',
 'usaxs_pilatus1:HDF1:FilePath',
 'usaxs_pilatus1:HDF1:Capture_RBV',
 'usaxs_pilatus1:HDF1:Capture',
 'usaxs_pilatus1:HDF1:FileNumber_RBV',
 'usaxs_pilatus1:HDF1:FileNumber',
 'usaxs_pilatus1:HDF1:EnableCallbacks_RBV',
 'usaxs_pilatus1:HDF1:EnableCallbacks',
 'usaxs_pilatus1:HDF1:AutoIncrement_RBV',
 'usaxs_pilatus1:HDF1:AutoIncrement',
 'usaxs_pilatus1:HDF1:ArrayCounter_RBV',
 'usaxs_pilatus1:HDF1:ArrayCounter',
 'usaxs_pilatus1:HDF1:AutoSave_RBV',
 'usaxs_pilatus1:HDF1:AutoSave',
 'usaxs_pilatus1:HDF1:NumCapture_RBV',
 'usaxs_pilatus1:HDF1:NumCapture',
 'usaxs_pilatus1:HDF1:FileTemplate_RBV',
 'usaxs_pilatus1:HDF1:FileTemplate',
 'usaxs_pilatus1:HDF1:FileWriteMode_RBV',
 'usaxs_pilatus1:HDF1:FileWriteMode',
 'usaxs_pilatus1:HDF1:BlockingCallbacks_RBV',
 'usaxs_pilatus1:HDF1:BlockingCallbacks',
 'usaxs_pilatus1:HDF1:FilePathExists_RBV',
 'usaxs_pilatus1:HDF1:PortName_RBV',
 'usaxs_pilatus1:HDF1:NDArrayPort_RBV',
 'usaxs_pilatus1:HDF1:NDArrayPort',
 'usaxs_pilatus1:cam1:ArrayCounter_RBV',
 'usaxs_pilatus1:cam1:ArrayCounter',
 'usaxs_pilatus1:cam1:ArraySizeY_RBV',
 'usaxs_pilatus1:cam1:ArraySizeX_RBV']

Here, more PVs finish their lazy connection once actual data acquisition starts.

prjemian commented 2 years ago

These are the saxs_det attributes that are set during device configuration (staging) before data acquisition starts.

In [9]: saxs_det.stage_sigs
Out[9]: OrderedDict([('cam.acquire', 0), ('cam.image_mode', 1)])

In [10]: saxs_det.cam.stage_sigs
Out[10]: OrderedDict()

In [11]: saxs_det.hdf1.stage_sigs
Out[11]: 
OrderedDict([('enable', 1),
             ('auto_increment', 'Yes'),
             ('array_counter', 0),
             ('auto_save', 'Yes'),
             ('num_capture', 0),
             ('file_template', '%s%s_%4.4d.h5'),
             ('file_write_mode', 'Stream'),
             ('capture', 1),
             ('blocking_callbacks', 'Yes'),
             ('parent.cam.array_callbacks', 1)])

They are reset to prior values once data acquisition is complete

prjemian commented 2 years ago

Since one of the failures observed happened while the IOC had been off for some time, I suspect the failure somewhere in the bluesky:PyEpics stack, with code that did not handle the virtual circuit disconnect. Not quite sure where yet but camonitor on these PVs would help only if the IOC was actually involved at the time of the crash.

prjemian commented 2 years ago

To test (following the pattern of use),

start the detector IOC
start a bluesky session
acquire some number of images from saxs_det
stop the detector IOC
wait a few minutes (such that any hanging PV get() requests in bluesky will timeout)
begin another bluesky data acquisition on some device that is not saxs_det

This is where the error is expected to occur, if it occurs at all

anjohnson commented 2 years ago

There is a comment in the CA library code for the thread which crashed, saying that it might get stuck in a loop there if the IOC is unresponsive (or dead) but the library hasn't realized it yet. That might happen if PyEpics/BS happened to go for a long time without making any libCa calls, although I hope PyEpics is coded to have preemptive callbacks enabled, in which case I don't think that would be likely (but I'm not the author of the comment).

@newville can you answer the question about the use of preemptive callbacks in PyEpics?

prjemian commented 2 years ago

note taking ... recorded a few images from saxs_det, waited a few minutes and turned off IOC. As expected, thismessage appears on bluesky's console:

In [13]: CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "s9pil.xray.aps.anl.gov:5064"
    Source File: ../cac.cpp line 1237
    Current Time: Tue Feb 22 2022 16:45:49.358661456
..................................................................

nothing unusual so far.

In [14]: saxs_det.connected
Out[14]: False

anjohnson commented 2 years ago

If nothing else, the "Current Time" will appear as a pair of integers separated by a '.' when it fails as before. That might be followed by other failures though, not sure.

prjemian commented 2 years ago

No error observed now (with IOC still off):

In [17]: def a_plan():
    ...:     reading = aps.current.read()
    ...:     logger.info("APS storage ring current: %s", reading)
    ...:     yield from bps.null()
    ...: 

In [18]: RE(a_plan())
I Tue-17:04:10 - APS storage ring current: {'aps_current': {'value': 101.97164667028201, 'timestamp': 1645571050.157049}}
Out[18]: ()

newville commented 2 years ago

@anjohnson @prjemian Yes, pre-emptive callbacks are enabled by default. It can be turned off, but I am almost certain BlueSky would use pre-emptive callbacks.

I would doubt very much if anything BlueSky was doing could cause such crashes. I can believe that there is something in pyepics causing this, but I would be surprised that we haven't run into this on other machines. For sure, python scripts running on Linux boxes that talk to area detectors and restarting IOCs seems completely normal to me.

Maybe an isolated script that took BlueSky out of the equation would help? Like if you just took that "cache list" of PVs, and did a basic

   import epics
   import time 
   pvnames = ['usaxs_pilatus1:cam1:Acquire_RBV',  'usaxs_pilatus1:cam1:Acquire',...]

   def monitor_callback(pvname, value, **kws): 
        print('pv value updated: ', pvname, value)

   pvlist = [epics.get_pv(name, callback=monitor_callback) for name in pvnames]

   while True:
       time.sleep(0.01)

When I do that for an Eiger detector (that I am willing to restart today) on a Centos 7 box that is running the IOC, the script runs fine, reporting changes in the ArrayCounter, for example. And, if I restart that IOC, the script reports the disconnection and then reconnects and starts reporting changes again.

prjemian commented 2 years ago

I confirm the same behavior (in ipython session, no bluesky):

...
update: pvname = 'usaxs_pilatus1:cam1:ArrayCounter_RBV'   value = 0
update: pvname = 'usaxs_pilatus1:cam1:ArrayCounter'   value = 0
update: pvname = 'usaxs_pilatus1:cam1:ArraySizeY_RBV'   value = 195
update: pvname = 'usaxs_pilatus1:cam1:ArraySizeX_RBV'   value = 487

then turn off IOC:

In [2]: while True:
   ...:     time.sleep(0.01)
   ...: 
CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "s9pil.xray.aps.anl.gov:5064"
    Source File: ../cac.cpp line 1237
    Current Time: Tue Feb 22 2022 17:40:56.497469931
..................................................................

then start IOC again:

update: pvname = 'usaxs_pilatus1:HDF1:FileNumber_RBV'   value = 103
update: pvname = 'usaxs_pilatus1:HDF1:FileNumber'   value = 103
update: pvname = 'usaxs_pilatus1:HDF1:EnableCallbacks_RBV'   value = 1
...

prjemian commented 2 years ago

This may (or may not) have been provoked by WAXS detector needing restart. SOmehow Dectirs software died.

@jilavsky What happened with the WAXS detector? Can we force that to happen again?

APS-USAXS / usaxs-bluesky-ended-2023

BS died and core dumped when WAXS detector IOC crashed #546

Summary