Closed mdavidsaver closed 1 year ago
I need more information as I'm unable to replicate either a stall or a crash.
Here's the thread backtraces from the core file: callstacks.txt
Are you also able to reproduce using a debug build? (eg. linux-x86_64-debug
).
fyi. you could run something like the following, though where python3
is whatever interpreter executable you use.
gdb --args python3 ./bin/linux-x86_64-debug/pvagw ...
Which is equivalent to eg.
export PYTHONPATH=${TOP}/python3.9/linux-x86_64-debug
gdb --args python3 -m p4p.gw ...
How do I create a debug build? I only see linux-x86_64
From: mdavidsaver @.> Sent: Friday, October 22, 2021 10:13 AM To: mdavidsaver/p4p @.> Cc: Chandler, Elaine @.>; Comment @.> Subject: Re: [mdavidsaver/p4p] Issues found testing P4P 4.0.0 pre-release (Issue #76)
Are you also able to reproduce using a debug build? (eg. linux-x86_64-debug).
fyi. you could run something like the following, though where python3 is whatever interpreter executable you use.
gdb --args python3 ./bin/linux-x86_64-debug/pvagw ...
Which is equivalent to eg.
export PYTHONPATH=${TOP}/python3.9/linux-x86_64-debug gdb --args python3 -m p4p.gw ...
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/mdavidsaver/p4p/issues/76#issuecomment-949722043, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ASTEPMJP7GPPK6IDYERMX4LUIF5QDANCNFSM5GHHFLJA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
cat <<EOF >> epics-base/configure/CONFIG_SITE.local
CROSS_COMPILER_TARGET_ARCHS+=\$(EPICS_HOST_ARCH)-debug
EOF
Yes I'm able to reproduce on a debug build. I attached the callstacks: callstack-debug.txt
Ok, there two interesting threads.
Thread 4 (Thread 0x7fd089f2d700 (LWP 494001)):
#0 0x00007fd0a6183464 in __memchr_avx2 () from /lib64/libc.so.6
#1 0x00007fd09848c778 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::find_first_of (this=0x7fd02003cf30, __s=0x7fd0989b4089 "<[-", __pos=6, __n=3) at /home/builder/ktietz/cos6/ci_cos6/ctng-compilers_1622658800915/work/.build/x86_64-conda-linux-gnu/build/build-cc-gcc-final/x86_64-conda-linux-gnu/libstdc++-v3/include/bits/basic_string.h:186
#2 0x00007fd0988eb53a in pvxs::Value::traverse (this=0x7fd089f2b180, expr=..., modify=true, dothrow=false) at ../data.cpp:772
#3 0x00007fd0988ec656 in pvxs::Value::operator[] (this=0x7fd010001970, name=...) at ../data.cpp:905
#4 0x00007fd0988ea8bd in pvxs::Value::copyIn (this=0x7fd010001970, ptr=0x7fd089f2cc60, type=pvxs::StoreType::Compound) at ../data.cpp:693
#5 0x00007fd0988e78a9 in pvxs::Value::assign (this=0x7fd010001970, o=...) at ../data.cpp:160
#6 0x00007fd098953e0c in pvxs::impl::(anonymous namespace)::ServerMonitorControl::doPost (this=0x7fd01801f6d0, val=..., maybe=false, force=false) at ../servermon.cpp:239
#7 0x00007fd08c7c1d48 in pvxs::server::MonitorControlOp::post (this=0x7fd01801f6d0, val=...) at /home/phoebus/ECHANDLER/sandbox/epics/pvxs/include/pvxs/source.h:80
#8 0x00007fd08c7b931b in p4p::onSubEvent (sub=..., pv=...) at ../pvxs_gw.cpp:574
#9 0x00007fd08c7b9af9 in p4p::GWChan::<lambda(pvxs::client::Subscription&, const pvxs::Value&)>::<lambda(pvxs::client::Subscription&)>::<lambda()>::operator()(void) const (__closure=0x7fd0228bd690) at ../pvxs_gw.cpp:684
#10 0x00007fd08c7be9cb in std::_Function_handler<void(), p4p::GWChan::onSubscribe(const std::shared_ptr<p4p::GWChan>&, std::unique_ptr<pvxs::server::MonitorSetupOp>&&)::<lambda(pvxs::client::Subscription&, const pvxs::Value&)>::<lambda(pvxs::client::Subscription&)>::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/8/bits/std_function.h:297
#11 0x00007fd08c7c68f2 in std::function<void ()>::operator()() const (this=0x7fd089f2cda0) at /usr/include/c++/8/bits/std_function.h:687
#12 0x00007fd08c7bc1e3 in p4p::GWSource::run (this=0x56088437fd40) at ../pvxs_gw.cpp:912
#13 0x00007fd09814f0dc in epicsThreadCallEntryPoint (pPvt=0x56088437fe68) at ../osi/epicsThread.cpp:95
#14 0x00007fd09815606f in start_routine (arg=0x560884443020) at ../osi/os/posix/osdThread.c:418
#15 0x00007fd0a63f714a in start_thread () from /lib64/libpthread.so.0
#16 0x00007fd0a6126dc3 in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7fd087b2e700 (LWP 494002)):
#0 0x00007fd098cf73c2 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7fd0203b83b0) at /usr/include/c++/8/bits/shared_ptr_base.h:155
#1 0x00007fd098cf5509 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7fd087b2d188, __in_chrg=<optimized out>) at /usr/include/c++/8/bits/shared_ptr_base.h:728
#2 0x00007fd0988fc8f0 in std::__shared_ptr<int const, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7fd087b2d180, __in_chrg=<optimized out>) at /usr/include/c++/8/bits/shared_ptr_base.h:1167
#3 0x00007fd0988fc90c in std::shared_ptr<int const>::~shared_ptr (this=0x7fd087b2d180, __in_chrg=<optimized out>) at /usr/include/c++/8/bits/shared_ptr.h:103
#4 0x00007fd0988fc928 in pvxs::detail::sa_base<int const>::~sa_base (this=0x7fd087b2d180, __in_chrg=<optimized out>) at ../pvxs/sharedArray.h:96
#5 0x00007fd0988fc944 in pvxs::shared_array<int const, void>::~shared_array (this=0x7fd087b2d180, __in_chrg=<optimized out>) at ../pvxs/sharedArray.h:238
#6 0x00007fd0988f89a4 in pvxs::impl::to_wire<int> (buf=..., varr=...) at ../pvaproto.h:416
#7 0x00007fd0988f5a41 in pvxs::impl::to_wire_field (buf=..., desc=0x7fd02003b230, store=...) at ../dataencode.cpp:343
#8 0x00007fd0988f5357 in pvxs::impl::to_wire_field (buf=..., desc=0x7fd0200322c0, store=...) at ../dataencode.cpp:235
#9 0x00007fd0988f63aa in pvxs::impl::to_wire_valid (buf=..., val=..., mask=0x7fd018035878) at ../dataencode.cpp:445
#10 0x00007fd098953358 in pvxs::impl::(anonymous namespace)::MonitorOp::doReply (this=0x7fd018035790) at ../servermon.cpp:138
#11 0x00007fd0989587e7 in std::__invoke_impl<void, void (pvxs::impl::(anonymous namespace)::MonitorOp::*&)(), std::shared_ptr<pvxs::impl::(anonymous namespace)::MonitorOp>&> (__f=@0x7fd022014dd0: (void (pvxs::impl::(anonymous namespace)::MonitorOp::*)(pvxs::impl::(anonymous namespace)::MonitorOp * const)) 0x7fd098952e62 <pvxs::impl::(anonymous namespace)::MonitorOp::doReply()>, __t=...) at /usr/include/c++/8/bits/invoke.h:73
#12 0x00007fd0989584d9 in std::__invoke<void (pvxs::impl::(anonymous namespace)::MonitorOp::*&)(), std::shared_ptr<pvxs::impl::(anonymous namespace)::MonitorOp>&> (__fn=@0x7fd022014dd0: (void (pvxs::impl::(anonymous namespace)::MonitorOp::*)(pvxs::impl::(anonymous namespace)::MonitorOp * const)) 0x7fd098952e62 <pvxs::impl::(anonymous namespace)::MonitorOp::doReply()>) at /usr/include/c++/8/bits/invoke.h:95
#13 0x00007fd098958272 in std::_Bind<void (pvxs::impl::(anonymous namespace)::MonitorOp::*(std::shared_ptr<pvxs::impl::(anonymous namespace)::MonitorOp>))()>::__call<void, 0>(std::tuple<> &&, std::_Index_tuple<0>) (this=0x7fd022014dd0, __args=...) at /usr/include/c++/8/functional:400
#14 0x00007fd098957e4a in std::_Bind<void (pvxs::impl::(anonymous namespace)::MonitorOp::*(std::shared_ptr<pvxs::impl::(anonymous namespace)::MonitorOp>))()>::operator()<>(void) (this=0x7fd022014dd0) at /usr/include/c++/8/functional:484
#15 0x00007fd098957428 in std::_Function_handler<void(), std::_Bind<void (pvxs::impl::(anonymous namespace)::MonitorOp::*(std::shared_ptr<pvxs::impl::(anonymous namespace)::MonitorOp>))()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/8/bits/std_function.h:297
#16 0x00007fd0988c439a in std::function<void ()>::operator()() const (this=0x7fd087b2db70) at /usr/include/c++/8/bits/std_function.h:687
#17 0x00007fd09893b76e in pvxs::impl::ServerConn::bevWrite (this=0x7fd018015490) at ../serverconn.cpp:370
#18 0x00007fd0989272eb in pvxs::impl::ConnBase::bevWriteS (bev=0x7fd01802cf70, ptr=0x7fd018015490) at ../conn.cpp:265
#19 0x00007fd097ee301a in bufferevent_run_deferred_callbacks_locked () from /lib64/libevent_core-2.1.so.6
#20 0x00007fd097eed788 in event_process_active_single_queue () from /lib64/libevent_core-2.1.so.6
#21 0x00007fd097eee3b7 in event_base_loop () from /lib64/libevent_core-2.1.so.6
#22 0x00007fd09890f8b9 in pvxs::impl::evbase::Pvt::run (this=0x5608844581e0) at ../evhelper.cpp:198
#23 0x00007fd09814f0dc in epicsThreadCallEntryPoint (pPvt=0x560884458278) at ../osi/epicsThread.cpp:95
#24 0x00007fd09815606f in start_routine (arg=0x5608840eb340) at ../osi/os/posix/osdThread.c:418
#25 0x00007fd0a63f714a in start_thread () from /lib64/libpthread.so.0
#26 0x00007fd0a6126dc3 in clone () from /lib64/libc.so.6
I can see two potential areas where bugs might be found. Concurrency, and PVD type handling.
To help understand the first, I'd like to know whether both of these threads are operating on the same (downstream) Channel.
Compare the local mon
in frame 6 of Thread 4 with this
in frame 10 of Thread 1. If these point to the same MonitorOp
then either the line numbers are wrong, or something strange has happened as MonitorOp::lock
should be held at both servermon.cpp:239
and servermon.cpp:138
.
Also in frame 10 of Thread 1 the expression ch->name
will give a PV name. I'd like to have the output of pvinfo
for this PV.
Thread 4 (Thread 0x7fd089f2d700 (LWP 494001)):
...
#6 0x00007fd098953e0c in pvxs::impl::(anonymous namespace)::ServerMonitorControl::doPost (this=0x7fd01801f6d0, val=..., maybe=false, force=false) at ../servermon.cpp:239
Thread 1 (Thread 0x7fd087b2e700 (LWP 494002)):
...
#10 0x00007fd098953358 in pvxs::impl::(anonymous namespace)::MonitorOp::doReply (this=0x7fd018035790) at ../servermon.cpp:138
The pointer values are different.
(gdb) thread 4
[Switching to thread 4 (Thread 0x7fd089f2d700 (LWP 494001))]
#0 0x00007fd0a6183464 in __memchr_avx2 () from /lib64/libc.so.6
(gdb) frame 6
#6 0x00007fd098953e0c in pvxs::impl::(anonymous namespace)::ServerMonitorControl::doPost (this=0x7fd01801f6d0, val=..., maybe=false, force=false) at ../servermon.cpp:239
239 mon->queue.back().assign(val);
(gdb) print mon
$1 = {<std::__shared_ptr<pvxs::impl::(anonymous namespace)::MonitorOp, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<pvxs::impl::(anonymous namespace)::MonitorOp, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x7fd018006610, _M_refcount = {
_M_pi = 0x7fd018006600}}, <No data fields>}
(gdb) thread 1
[Switching to thread 1 (Thread 0x7fd087b2e700 (LWP 494002))]
#0 0x00007fd098cf73c2 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7fd0203b83b0) at /usr/include/c++/8/bits/shared_ptr_base.h:155
155 _M_dispose();
(gdb) frame 10
#10 0x00007fd098953358 in pvxs::impl::(anonymous namespace)::MonitorOp::doReply (this=0x7fd018035790) at ../servermon.cpp:138
138 to_wire_valid(R, ent, &pvMask);
(gdb) print this
$2 = (pvxs::impl::(anonymous namespace)::MonitorOp * const) 0x7fd018035790
Here's the pvinfo
$ pvinfo S01-DAQTBT:Raw:Data
S01-DAQTBT:Raw:Data
Server: 164.54.48.193:45645
Type:
structure
uint objectId
double objectTime
time_t objectTimestamp
long secondsPastEpoch
int nanoseconds
int userTag
ushort doubleSector
structure[] events
structure
structure data
ushort value
time_t timestamp
long secondsPastEpoch
int nanoseconds
int userTag
uint nTurnsInObject
ulong firstTurnNumber
time_t firstTurnTimestamp
long secondsPastEpoch
int nanoseconds
int userTag
ulong lastTurnNumber
time_t turnReferenceTimestamp
long secondsPastEpoch
int nanoseconds
int userTag
structure[] invalidTurns
structure
structure data
uint value
time_t timestamp
long secondsPastEpoch
int nanoseconds
int userTag
structure[] missingTurns
structure
structure data
uint value
time_t timestamp
long secondsPastEpoch
int nanoseconds
int userTag
double[] time
uint[] secondsPastEpoch
uint[] nanoseconds
uint[] relativeTurn
structure s40ap5
int[] va
int[] vb
int[] vc
int[] vd
int[] sum
int[] q
int[] x
int[] y
structure s40ap6
int[] va
int[] vb
int[] vc
int[] vd
int[] sum
int[] q
int[] x
int[] y
structure s40bp6
int[] va
int[] vb
int[] vc
int[] vd
int[] sum
int[] q
int[] x
int[] y
structure s40bp5
int[] va
int[] vb
int[] vc
int[] vd
int[] sum
int[] q
int[] x
int[] y
structure s40bp4
int[] va
int[] vb
int[] vc
int[] vd
int[] sum
int[] q
int[] x
int[] y
structure s40bp3
int[] va
int[] vb
int[] vc
int[] vd
int[] sum
int[] q
int[] x
int[] y
structure s40bp2
int[] va
int[] vb
int[] vc
int[] vd
int[] sum
int[] q
int[] x
int[] y
structure s40bp1
int[] va
int[] vb
int[] vc
int[] vd
int[] sum
int[] q
int[] x
int[] y
structure s40bp0
int[] va
int[] vb
int[] vc
int[] vd
int[] sum
int[] q
int[] x
int[] y
structure s01ap0
int[] va
int[] vb
int[] vc
int[] vd
int[] sum
int[] q
int[] x
int[] y
structure s01ap1
int[] va
int[] vb
int[] vc
int[] vd
int[] sum
int[] q
int[] x
int[] y
structure s01ap2
int[] va
int[] vb
int[] vc
int[] vd
int[] sum
int[] q
int[] x
int[] y
structure s01ap3
int[] va
int[] vb
int[] vc
int[] vd
int[] sum
int[] q
int[] x
int[] y
structure s01ap4
int[] va
int[] vb
int[] vc
int[] vd
int[] sum
int[] q
int[] x
int[] y
structure s01ap5
int[] va
int[] vb
int[] vc
int[] vd
int[] sum
int[] q
int[] x
int[] y
structure s01ap6
int[] va
int[] vb
int[] vc
int[] vd
int[] sum
int[] q
int[] x
int[] y
structure s01bp6
int[] va
int[] vb
int[] vc
int[] vd
int[] sum
int[] q
int[] x
int[] y
structure s01bp5
int[] va
int[] vb
int[] vc
int[] vd
int[] sum
int[] q
int[] x
int[] y
structure s01bp4
int[] va
int[] vb
int[] vc
int[] vd
int[] sum
int[] q
int[] x
int[] y
structure s01bp3
int[] va
int[] vb
int[] vc
int[] vd
int[] sum
int[] q
int[] x
int[] y
structure s01bp2
int[] va
int[] vb
int[] vc
int[] vd
int[] sum
int[] q
int[] x
int[] y
structure s01bp1
int[] va
int[] vb
int[] vc
int[] vd
int[] sum
int[] q
int[] x
int[] y
structure s01bp0
int[] va
int[] vb
int[] vc
int[] vd
int[] sum
int[] q
int[] x
int[] y
structure s02ap0
int[] va
int[] vb
int[] vc
int[] vd
int[] sum
int[] q
int[] x
int[] y
structure s02ap1
int[] va
int[] vb
int[] vc
int[] vd
int[] sum
int[] q
int[] x
int[] y
structure s02ap2
int[] va
int[] vb
int[] vc
int[] vd
int[] sum
int[] q
int[] x
int[] y
structure s02ap3
int[] va
int[] vb
int[] vc
int[] vd
int[] sum
int[] q
int[] x
int[] y
structure s02ap4
int[] va
int[] vb
int[] vc
int[] vd
int[] sum
int[] q
int[] x
int[] y
I reran the scalability test on v4.0.0a3, and it performs better than the previous release, but the throughput is still worse than v3.5.
Here's the receiving data rates (average samples per second) for 4.0.0a3 and 3.5:
... Here's the pvinfo ...
Wow! That is quite a large struct. I count 278 fields. It is possible than there is an issue with structures with >64 fields. Possibly related to changed bitmask handling.
Can you attach the output of the pvxmonitor
CLI utility with this PV for a few updates?
The default -F delta
will show which fields are/aren't changing. If this doesn't give any hints I may need to ask for a wireshark capture.
... Here's the receiving data rates ...
While this is a nice graph, it doesn't give me any hints about why there is a difference. As long as I'm unable to replicate this scenario, I'll depend on you who can to investigate what is (or isn't) going on. eg. do you notice significant CPU load any cases? Do you have access/experience with any profiling tools?
Here's the output from pvxmonitor -F delta: pvxmon.out.zip
So I do track the CPU and memory usage for each run. v4.0 does use more CPU and significantly more memory.
For example in the 5 monitor run, p4p 4.0, the average stats:
2021-10-25 13:19:50 INFO P4P PV Gateway: CPU percent AVG 104.118644 MAX 126.700000, MEM (MiB) AVG 7005.788758 MEM 12902.400000
The top CPU usage per thread:
2021-10-25 13:19:50 INFO P4P PV Gateway | GWQ (829722): CPU percent AVG 17.000000 MAX 99.900000
2021-10-25 13:19:50 INFO P4P PV Gateway | PVXTCP (829723): CPU percent AVG 86.235849 MAX 99.900000
2021-10-25 13:19:50 INFO P4P PV Gateway | PVXCTCP (829720): CPU percent AVG 20.903704 MAX 37.500000
2021-10-25 13:19:50 INFO P4P PV Gateway | pvagw (829475): CPU percent AVG 0.000000 MAX 0.000000
2021-10-25 13:19:50 INFO P4P PV Gateway | pvagw (829675): CPU percent AVG 0.000000 MAX 0.000000
2021-10-25 13:19:50 INFO P4P PV Gateway | pvagw (829676): CPU percent AVG 0.000000 MAX 0.000000
2021-10-25 13:19:50 INFO P4P PV Gateway | pvagw (829677): CPU percent AVG 0.000000 MAX 0.000000
2021-10-25 13:19:50 INFO P4P PV Gateway | pvagw (829678): CPU percent AVG 0.000000 MAX 0.000000
2021-10-25 13:19:50 INFO P4P PV Gateway | pvagw (829679): CPU percent AVG 0.000000 MAX 0.000000
By comparison here are the p4p 3.5 stats:
2021-10-25 14:13:15 INFO P4P PV Gateway: CPU percent AVG 65.112698 MAX 113.300000, MEM (MiB) AVG 251.745536 MEM 272.667969
021-10-25 14:13:15 INFO P4P PV Gateway | TCP-rx (986221): CPU percent AVG 20.975000 MAX 40.000000
2021-10-25 14:13:15 INFO P4P PV Gateway | TCP-tx (986235): CPU percent AVG 11.451667 MAX 33.300000
2021-10-25 14:13:15 INFO P4P PV Gateway | TCP-tx (986231): CPU percent AVG 10.478333 MAX 20.000000
2021-10-25 14:13:15 INFO P4P PV Gateway | TCP-tx (986237): CPU percent AVG 9.280000 MAX 20.000000
2021-10-25 14:13:15 INFO P4P PV Gateway | TCP-tx (986229): CPU percent AVG 9.603333 MAX 13.300000
2021-10-25 14:13:15 INFO P4P PV Gateway | TCP-tx (986233): CPU percent AVG 9.713333 MAX 13.300000
2021-10-25 14:13:15 INFO P4P PV Gateway | TCP-rx (986228): CPU percent AVG 6.533333 MAX 6.700000
2021-10-25 14:13:15 INFO P4P PV Gateway | TCP-rx (986230): CPU percent AVG 6.533333 MAX 6.700000
2021-10-25 14:13:15 INFO P4P PV Gateway | TCP-rx (986232): CPU percent AVG 6.700000 MAX 6.700000
2021-10-25 14:13:15 INFO P4P PV Gateway | TCP-rx (986234): CPU percent AVG 6.450000 MAX 6.700000
2021-10-25 14:13:15 INFO P4P PV Gateway | TCP-rx (986236): CPU percent AVG 6.533333 MAX 6.700000
Two things stand out. First, v4.0 memory spiked up to 12GB and used on average over 20 times more memory. Second is that the GWQ and PVXTCP can max out the CPU core, with PVXTCP running most of time at full capacity. This can indicate that there may be a bottleneck in those threads.
I attached the graphs from a few runs here: graphs.zip
I hope that helps.
... significantly more memory ...
How do you come be this number? I this allocated virtual memory? Currently resident memory? Something else?
... that the GWQ and PVXTCP can max out the CPU core, with PVXTCP running most of time at full capacity ...
Ok, now we are getting somewhere. Can you do some profiling to see if there are any obvious hot spots in the code? eg. using linux-perf or similar which you or someone at APS has experience with.
The memory is the resident memory from top.
I attached the the profile from perf running the test with 8 PV monitors.
The .out file is from running perf report -n --stdio
and the .svg is a framegraph from the profile. I noticed that for one of the runs it also crashed with the same "corrupt double-link list" error, and the memory usage went up to 20GB.
As a note, I've still been unable to reproduce a similar crash.
~1.5 years on, and I have not had any similar reports. Perhaps I have manged to fix whatever caused this crash incidentally?