epics-base / p4p

Python bindings for the PVAccess network client and server.
BSD 3-Clause "New" or "Revised" License
26 stars 37 forks source link

Issues found testing P4P 4.0.0 pre-release #76

Closed mdavidsaver closed 1 year ago

mdavidsaver commented 2 years ago
On 9/9/21 11:31 AM, Chandler, Elaine wrote:
Hi Michael,

We ran some load tests on the PV Gateway on the P4P pre-release (4.0) and found a couple potential issues that we wanted to make sure you were aware of.

1. PV clients not receiving data when data connection is saturated
We tested increasing number of PV monitors and found when the data rate is more than the NIC data limit, the PV monitors weren't receiving any data.  In P2P and the P4P 3.5, it would still be sending data, but at a lower data rate. Below shows the receiving sample and data rates of the PV monitors. 

Gateway | # monitors | avg sample rate (Hz) | avg data rate (MiB/s) | total data rate (MB/s) -- | -- | -- | -- | -- p4p 4.0 | 2 | 10.00 | 2370 | 4740 p4p 4.0 | 3 | 10.00 | 2370 | 7110 p4p 4.0 | 4 | 9.74 | 2308 | 9232 p4p 4.0 | 5 | 9.70 | 2299 | 11495 p4p 4.0 | 6 | 0.03 | 8 | 48 p4p 4.0 | 7 | 0.00 | 0 | 0 p4p 4.0 | 8 | 0.00 | 0 | 0 p4p 4.0 | 9 | 0.00 | 0 | 0 p4p 4.0 | 10 | 0.11 | 26 | 263 p4p 4.0 | 11 | 0.07 | 15 | 169 p4p 4.0 | 12 | 0.00 | 0 | 0


2. PV Gateway crashes when running with machine with more load

We found that when we ran PV Gateway with our regular load test, it would crash with either a "corrupt double-link list" or "free(): corrupted unsorted chunks" error.  In our test scenario we have 3 IOCs, the PV Gateway, and 3 other services running on the server.  I noticed that the memory usage for the gateway goes up right before the crash:
image
mdavidsaver commented 2 years ago

I need more information as I'm unable to replicate either a stall or a crash.

echandler-anl commented 2 years ago

Here's the thread backtraces from the core file: callstacks.txt

mdavidsaver commented 2 years ago

Are you also able to reproduce using a debug build? (eg. linux-x86_64-debug).

fyi. you could run something like the following, though where python3 is whatever interpreter executable you use.

gdb --args python3 ./bin/linux-x86_64-debug/pvagw ...

Which is equivalent to eg.


export PYTHONPATH=${TOP}/python3.9/linux-x86_64-debug
gdb --args python3 -m p4p.gw ...
echandler-anl commented 2 years ago

How do I create a debug build? I only see linux-x86_64


From: mdavidsaver @.> Sent: Friday, October 22, 2021 10:13 AM To: mdavidsaver/p4p @.> Cc: Chandler, Elaine @.>; Comment @.> Subject: Re: [mdavidsaver/p4p] Issues found testing P4P 4.0.0 pre-release (Issue #76)

Are you also able to reproduce using a debug build? (eg. linux-x86_64-debug).

fyi. you could run something like the following, though where python3 is whatever interpreter executable you use.

gdb --args python3 ./bin/linux-x86_64-debug/pvagw ...

Which is equivalent to eg.

export PYTHONPATH=${TOP}/python3.9/linux-x86_64-debug gdb --args python3 -m p4p.gw ...

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/mdavidsaver/p4p/issues/76#issuecomment-949722043, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ASTEPMJP7GPPK6IDYERMX4LUIF5QDANCNFSM5GHHFLJA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

mdavidsaver commented 2 years ago
cat <<EOF >> epics-base/configure/CONFIG_SITE.local
CROSS_COMPILER_TARGET_ARCHS+=\$(EPICS_HOST_ARCH)-debug
EOF
echandler-anl commented 2 years ago

Yes I'm able to reproduce on a debug build. I attached the callstacks: callstack-debug.txt

mdavidsaver commented 2 years ago

Ok, there two interesting threads.

Thread 4 (Thread 0x7fd089f2d700 (LWP 494001)):
#0  0x00007fd0a6183464 in __memchr_avx2 () from /lib64/libc.so.6
#1  0x00007fd09848c778 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::find_first_of (this=0x7fd02003cf30, __s=0x7fd0989b4089 "<[-", __pos=6, __n=3) at /home/builder/ktietz/cos6/ci_cos6/ctng-compilers_1622658800915/work/.build/x86_64-conda-linux-gnu/build/build-cc-gcc-final/x86_64-conda-linux-gnu/libstdc++-v3/include/bits/basic_string.h:186
#2  0x00007fd0988eb53a in pvxs::Value::traverse (this=0x7fd089f2b180, expr=..., modify=true, dothrow=false) at ../data.cpp:772
#3  0x00007fd0988ec656 in pvxs::Value::operator[] (this=0x7fd010001970, name=...) at ../data.cpp:905
#4  0x00007fd0988ea8bd in pvxs::Value::copyIn (this=0x7fd010001970, ptr=0x7fd089f2cc60, type=pvxs::StoreType::Compound) at ../data.cpp:693
#5  0x00007fd0988e78a9 in pvxs::Value::assign (this=0x7fd010001970, o=...) at ../data.cpp:160
#6  0x00007fd098953e0c in pvxs::impl::(anonymous namespace)::ServerMonitorControl::doPost (this=0x7fd01801f6d0, val=..., maybe=false, force=false) at ../servermon.cpp:239
#7  0x00007fd08c7c1d48 in pvxs::server::MonitorControlOp::post (this=0x7fd01801f6d0, val=...) at /home/phoebus/ECHANDLER/sandbox/epics/pvxs/include/pvxs/source.h:80
#8  0x00007fd08c7b931b in p4p::onSubEvent (sub=..., pv=...) at ../pvxs_gw.cpp:574
#9  0x00007fd08c7b9af9 in p4p::GWChan::<lambda(pvxs::client::Subscription&, const pvxs::Value&)>::<lambda(pvxs::client::Subscription&)>::<lambda()>::operator()(void) const (__closure=0x7fd0228bd690) at ../pvxs_gw.cpp:684
#10 0x00007fd08c7be9cb in std::_Function_handler<void(), p4p::GWChan::onSubscribe(const std::shared_ptr<p4p::GWChan>&, std::unique_ptr<pvxs::server::MonitorSetupOp>&&)::<lambda(pvxs::client::Subscription&, const pvxs::Value&)>::<lambda(pvxs::client::Subscription&)>::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/8/bits/std_function.h:297
#11 0x00007fd08c7c68f2 in std::function<void ()>::operator()() const (this=0x7fd089f2cda0) at /usr/include/c++/8/bits/std_function.h:687
#12 0x00007fd08c7bc1e3 in p4p::GWSource::run (this=0x56088437fd40) at ../pvxs_gw.cpp:912
#13 0x00007fd09814f0dc in epicsThreadCallEntryPoint (pPvt=0x56088437fe68) at ../osi/epicsThread.cpp:95
#14 0x00007fd09815606f in start_routine (arg=0x560884443020) at ../osi/os/posix/osdThread.c:418
#15 0x00007fd0a63f714a in start_thread () from /lib64/libpthread.so.0
#16 0x00007fd0a6126dc3 in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7fd087b2e700 (LWP 494002)):
#0  0x00007fd098cf73c2 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7fd0203b83b0) at /usr/include/c++/8/bits/shared_ptr_base.h:155
#1  0x00007fd098cf5509 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7fd087b2d188, __in_chrg=<optimized out>) at /usr/include/c++/8/bits/shared_ptr_base.h:728
#2  0x00007fd0988fc8f0 in std::__shared_ptr<int const, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7fd087b2d180, __in_chrg=<optimized out>) at /usr/include/c++/8/bits/shared_ptr_base.h:1167
#3  0x00007fd0988fc90c in std::shared_ptr<int const>::~shared_ptr (this=0x7fd087b2d180, __in_chrg=<optimized out>) at /usr/include/c++/8/bits/shared_ptr.h:103
#4  0x00007fd0988fc928 in pvxs::detail::sa_base<int const>::~sa_base (this=0x7fd087b2d180, __in_chrg=<optimized out>) at ../pvxs/sharedArray.h:96
#5  0x00007fd0988fc944 in pvxs::shared_array<int const, void>::~shared_array (this=0x7fd087b2d180, __in_chrg=<optimized out>) at ../pvxs/sharedArray.h:238
#6  0x00007fd0988f89a4 in pvxs::impl::to_wire<int> (buf=..., varr=...) at ../pvaproto.h:416
#7  0x00007fd0988f5a41 in pvxs::impl::to_wire_field (buf=..., desc=0x7fd02003b230, store=...) at ../dataencode.cpp:343
#8  0x00007fd0988f5357 in pvxs::impl::to_wire_field (buf=..., desc=0x7fd0200322c0, store=...) at ../dataencode.cpp:235
#9  0x00007fd0988f63aa in pvxs::impl::to_wire_valid (buf=..., val=..., mask=0x7fd018035878) at ../dataencode.cpp:445
#10 0x00007fd098953358 in pvxs::impl::(anonymous namespace)::MonitorOp::doReply (this=0x7fd018035790) at ../servermon.cpp:138
#11 0x00007fd0989587e7 in std::__invoke_impl<void, void (pvxs::impl::(anonymous namespace)::MonitorOp::*&)(), std::shared_ptr<pvxs::impl::(anonymous namespace)::MonitorOp>&> (__f=@0x7fd022014dd0: (void (pvxs::impl::(anonymous namespace)::MonitorOp::*)(pvxs::impl::(anonymous namespace)::MonitorOp * const)) 0x7fd098952e62 <pvxs::impl::(anonymous namespace)::MonitorOp::doReply()>, __t=...) at /usr/include/c++/8/bits/invoke.h:73
#12 0x00007fd0989584d9 in std::__invoke<void (pvxs::impl::(anonymous namespace)::MonitorOp::*&)(), std::shared_ptr<pvxs::impl::(anonymous namespace)::MonitorOp>&> (__fn=@0x7fd022014dd0: (void (pvxs::impl::(anonymous namespace)::MonitorOp::*)(pvxs::impl::(anonymous namespace)::MonitorOp * const)) 0x7fd098952e62 <pvxs::impl::(anonymous namespace)::MonitorOp::doReply()>) at /usr/include/c++/8/bits/invoke.h:95
#13 0x00007fd098958272 in std::_Bind<void (pvxs::impl::(anonymous namespace)::MonitorOp::*(std::shared_ptr<pvxs::impl::(anonymous namespace)::MonitorOp>))()>::__call<void, 0>(std::tuple<> &&, std::_Index_tuple<0>) (this=0x7fd022014dd0, __args=...) at /usr/include/c++/8/functional:400
#14 0x00007fd098957e4a in std::_Bind<void (pvxs::impl::(anonymous namespace)::MonitorOp::*(std::shared_ptr<pvxs::impl::(anonymous namespace)::MonitorOp>))()>::operator()<>(void) (this=0x7fd022014dd0) at /usr/include/c++/8/functional:484
#15 0x00007fd098957428 in std::_Function_handler<void(), std::_Bind<void (pvxs::impl::(anonymous namespace)::MonitorOp::*(std::shared_ptr<pvxs::impl::(anonymous namespace)::MonitorOp>))()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/8/bits/std_function.h:297
#16 0x00007fd0988c439a in std::function<void ()>::operator()() const (this=0x7fd087b2db70) at /usr/include/c++/8/bits/std_function.h:687
#17 0x00007fd09893b76e in pvxs::impl::ServerConn::bevWrite (this=0x7fd018015490) at ../serverconn.cpp:370
#18 0x00007fd0989272eb in pvxs::impl::ConnBase::bevWriteS (bev=0x7fd01802cf70, ptr=0x7fd018015490) at ../conn.cpp:265
#19 0x00007fd097ee301a in bufferevent_run_deferred_callbacks_locked () from /lib64/libevent_core-2.1.so.6
#20 0x00007fd097eed788 in event_process_active_single_queue () from /lib64/libevent_core-2.1.so.6
#21 0x00007fd097eee3b7 in event_base_loop () from /lib64/libevent_core-2.1.so.6
#22 0x00007fd09890f8b9 in pvxs::impl::evbase::Pvt::run (this=0x5608844581e0) at ../evhelper.cpp:198
#23 0x00007fd09814f0dc in epicsThreadCallEntryPoint (pPvt=0x560884458278) at ../osi/epicsThread.cpp:95
#24 0x00007fd09815606f in start_routine (arg=0x5608840eb340) at ../osi/os/posix/osdThread.c:418
#25 0x00007fd0a63f714a in start_thread () from /lib64/libpthread.so.0
#26 0x00007fd0a6126dc3 in clone () from /lib64/libc.so.6
mdavidsaver commented 2 years ago

I can see two potential areas where bugs might be found. Concurrency, and PVD type handling.

To help understand the first, I'd like to know whether both of these threads are operating on the same (downstream) Channel.

Compare the local mon in frame 6 of Thread 4 with this in frame 10 of Thread 1. If these point to the same MonitorOp then either the line numbers are wrong, or something strange has happened as MonitorOp::lock should be held at both servermon.cpp:239 and servermon.cpp:138.

Also in frame 10 of Thread 1 the expression ch->name will give a PV name. I'd like to have the output of pvinfo for this PV.

Thread 4 (Thread 0x7fd089f2d700 (LWP 494001)):
...
#6  0x00007fd098953e0c in pvxs::impl::(anonymous namespace)::ServerMonitorControl::doPost (this=0x7fd01801f6d0, val=..., maybe=false, force=false) at ../servermon.cpp:239
Thread 1 (Thread 0x7fd087b2e700 (LWP 494002)):
...
#10 0x00007fd098953358 in pvxs::impl::(anonymous namespace)::MonitorOp::doReply (this=0x7fd018035790) at ../servermon.cpp:138
echandler-anl commented 2 years ago

The pointer values are different.

(gdb) thread 4
[Switching to thread 4 (Thread 0x7fd089f2d700 (LWP 494001))]
#0  0x00007fd0a6183464 in __memchr_avx2 () from /lib64/libc.so.6
(gdb) frame 6
#6  0x00007fd098953e0c in pvxs::impl::(anonymous namespace)::ServerMonitorControl::doPost (this=0x7fd01801f6d0, val=..., maybe=false, force=false) at ../servermon.cpp:239
239                 mon->queue.back().assign(val);
(gdb) print mon
$1 = {<std::__shared_ptr<pvxs::impl::(anonymous namespace)::MonitorOp, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<pvxs::impl::(anonymous namespace)::MonitorOp, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x7fd018006610, _M_refcount = {
      _M_pi = 0x7fd018006600}}, <No data fields>}
(gdb) thread 1
[Switching to thread 1 (Thread 0x7fd087b2e700 (LWP 494002))]
#0  0x00007fd098cf73c2 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7fd0203b83b0) at /usr/include/c++/8/bits/shared_ptr_base.h:155
155         _M_dispose();
(gdb) frame 10
#10 0x00007fd098953358 in pvxs::impl::(anonymous namespace)::MonitorOp::doReply (this=0x7fd018035790) at ../servermon.cpp:138
138                     to_wire_valid(R, ent, &pvMask);
(gdb) print this
$2 = (pvxs::impl::(anonymous namespace)::MonitorOp * const) 0x7fd018035790

Here's the pvinfo

$ pvinfo S01-DAQTBT:Raw:Data
S01-DAQTBT:Raw:Data
Server: 164.54.48.193:45645
Type:
    structure
        uint objectId
        double objectTime
        time_t objectTimestamp
            long secondsPastEpoch
            int nanoseconds
            int userTag
        ushort doubleSector
        structure[] events
            structure
                structure data
                    ushort value
                time_t timestamp
                    long secondsPastEpoch
                    int nanoseconds
                    int userTag
        uint nTurnsInObject
        ulong firstTurnNumber
        time_t firstTurnTimestamp
            long secondsPastEpoch
            int nanoseconds
            int userTag
        ulong lastTurnNumber
        time_t turnReferenceTimestamp
            long secondsPastEpoch
            int nanoseconds
            int userTag
        structure[] invalidTurns
            structure
                structure data
                    uint value
                time_t timestamp
                    long secondsPastEpoch
                    int nanoseconds
                    int userTag
        structure[] missingTurns
            structure
                structure data
                    uint value
                time_t timestamp
                    long secondsPastEpoch
                    int nanoseconds
                    int userTag
        double[] time
        uint[] secondsPastEpoch
        uint[] nanoseconds
        uint[] relativeTurn
        structure s40ap5
            int[] va
            int[] vb
            int[] vc
            int[] vd
            int[] sum
            int[] q
            int[] x
            int[] y
        structure s40ap6
            int[] va
            int[] vb
            int[] vc
            int[] vd
            int[] sum
            int[] q
            int[] x
            int[] y
        structure s40bp6
            int[] va
            int[] vb
            int[] vc
            int[] vd
            int[] sum
            int[] q
            int[] x
            int[] y
        structure s40bp5
            int[] va
            int[] vb
            int[] vc
            int[] vd
            int[] sum
            int[] q
            int[] x
            int[] y
        structure s40bp4
            int[] va
            int[] vb
            int[] vc
            int[] vd
            int[] sum
            int[] q
            int[] x
            int[] y
        structure s40bp3
            int[] va
            int[] vb
            int[] vc
            int[] vd
            int[] sum
            int[] q
            int[] x
            int[] y
        structure s40bp2
            int[] va
            int[] vb
            int[] vc
            int[] vd
            int[] sum
            int[] q
            int[] x
            int[] y
        structure s40bp1
            int[] va
            int[] vb
            int[] vc
            int[] vd
            int[] sum
            int[] q
            int[] x
            int[] y
        structure s40bp0
            int[] va
            int[] vb
            int[] vc
            int[] vd
            int[] sum
            int[] q
            int[] x
            int[] y
        structure s01ap0
            int[] va
            int[] vb
            int[] vc
            int[] vd
            int[] sum
            int[] q
            int[] x
            int[] y
        structure s01ap1
            int[] va
            int[] vb
            int[] vc
            int[] vd
            int[] sum
            int[] q
            int[] x
            int[] y
        structure s01ap2
            int[] va
            int[] vb
            int[] vc
            int[] vd
            int[] sum
            int[] q
            int[] x
            int[] y
        structure s01ap3
            int[] va
            int[] vb
            int[] vc
            int[] vd
            int[] sum
            int[] q
            int[] x
            int[] y
        structure s01ap4
            int[] va
            int[] vb
            int[] vc
            int[] vd
            int[] sum
            int[] q
            int[] x
            int[] y
        structure s01ap5
            int[] va
            int[] vb
            int[] vc
            int[] vd
            int[] sum
            int[] q
            int[] x
            int[] y
        structure s01ap6
            int[] va
            int[] vb
            int[] vc
            int[] vd
            int[] sum
            int[] q
            int[] x
            int[] y
        structure s01bp6
            int[] va
            int[] vb
            int[] vc
            int[] vd
            int[] sum
            int[] q
            int[] x
            int[] y
        structure s01bp5
            int[] va
            int[] vb
            int[] vc
            int[] vd
            int[] sum
            int[] q
            int[] x
            int[] y
        structure s01bp4
            int[] va
            int[] vb
            int[] vc
            int[] vd
            int[] sum
            int[] q
            int[] x
            int[] y
        structure s01bp3
            int[] va
            int[] vb
            int[] vc
            int[] vd
            int[] sum
            int[] q
            int[] x
            int[] y
        structure s01bp2
            int[] va
            int[] vb
            int[] vc
            int[] vd
            int[] sum
            int[] q
            int[] x
            int[] y
        structure s01bp1
            int[] va
            int[] vb
            int[] vc
            int[] vd
            int[] sum
            int[] q
            int[] x
            int[] y
        structure s01bp0
            int[] va
            int[] vb
            int[] vc
            int[] vd
            int[] sum
            int[] q
            int[] x
            int[] y
        structure s02ap0
            int[] va
            int[] vb
            int[] vc
            int[] vd
            int[] sum
            int[] q
            int[] x
            int[] y
        structure s02ap1
            int[] va
            int[] vb
            int[] vc
            int[] vd
            int[] sum
            int[] q
            int[] x
            int[] y
        structure s02ap2
            int[] va
            int[] vb
            int[] vc
            int[] vd
            int[] sum
            int[] q
            int[] x
            int[] y
        structure s02ap3
            int[] va
            int[] vb
            int[] vc
            int[] vd
            int[] sum
            int[] q
            int[] x
            int[] y
        structure s02ap4
            int[] va
            int[] vb
            int[] vc
            int[] vd
            int[] sum
            int[] q
            int[] x
            int[] y
echandler-anl commented 2 years ago

I reran the scalability test on v4.0.0a3, and it performs better than the previous release, but the throughput is still worse than v3.5.

Here's the receiving data rates (average samples per second) for 4.0.0a3 and 3.5:

image
mdavidsaver commented 2 years ago

... Here's the pvinfo ...

Wow! That is quite a large struct. I count 278 fields. It is possible than there is an issue with structures with >64 fields. Possibly related to changed bitmask handling.

Can you attach the output of the pvxmonitor CLI utility with this PV for a few updates?

The default -F delta will show which fields are/aren't changing. If this doesn't give any hints I may need to ask for a wireshark capture.

mdavidsaver commented 2 years ago

... Here's the receiving data rates ...

While this is a nice graph, it doesn't give me any hints about why there is a difference. As long as I'm unable to replicate this scenario, I'll depend on you who can to investigate what is (or isn't) going on. eg. do you notice significant CPU load any cases? Do you have access/experience with any profiling tools?

echandler-anl commented 2 years ago

Here's the output from pvxmonitor -F delta: pvxmon.out.zip

echandler-anl commented 2 years ago

So I do track the CPU and memory usage for each run. v4.0 does use more CPU and significantly more memory.
For example in the 5 monitor run, p4p 4.0, the average stats: 2021-10-25 13:19:50 INFO P4P PV Gateway: CPU percent AVG 104.118644 MAX 126.700000, MEM (MiB) AVG 7005.788758 MEM 12902.400000

The top CPU usage per thread:

2021-10-25 13:19:50 INFO   P4P PV Gateway | GWQ (829722): CPU percent AVG 17.000000 MAX 99.900000
2021-10-25 13:19:50 INFO   P4P PV Gateway | PVXTCP (829723): CPU percent AVG 86.235849 MAX 99.900000
2021-10-25 13:19:50 INFO   P4P PV Gateway | PVXCTCP (829720): CPU percent AVG 20.903704 MAX 37.500000
2021-10-25 13:19:50 INFO   P4P PV Gateway | pvagw (829475): CPU percent AVG 0.000000 MAX 0.000000
2021-10-25 13:19:50 INFO   P4P PV Gateway | pvagw (829675): CPU percent AVG 0.000000 MAX 0.000000
2021-10-25 13:19:50 INFO   P4P PV Gateway | pvagw (829676): CPU percent AVG 0.000000 MAX 0.000000
2021-10-25 13:19:50 INFO   P4P PV Gateway | pvagw (829677): CPU percent AVG 0.000000 MAX 0.000000
2021-10-25 13:19:50 INFO   P4P PV Gateway | pvagw (829678): CPU percent AVG 0.000000 MAX 0.000000
2021-10-25 13:19:50 INFO   P4P PV Gateway | pvagw (829679): CPU percent AVG 0.000000 MAX 0.000000

By comparison here are the p4p 3.5 stats: 2021-10-25 14:13:15 INFO P4P PV Gateway: CPU percent AVG 65.112698 MAX 113.300000, MEM (MiB) AVG 251.745536 MEM 272.667969

021-10-25 14:13:15 INFO   P4P PV Gateway | TCP-rx (986221): CPU percent AVG 20.975000 MAX 40.000000
2021-10-25 14:13:15 INFO   P4P PV Gateway | TCP-tx (986235): CPU percent AVG 11.451667 MAX 33.300000
2021-10-25 14:13:15 INFO   P4P PV Gateway | TCP-tx (986231): CPU percent AVG 10.478333 MAX 20.000000
2021-10-25 14:13:15 INFO   P4P PV Gateway | TCP-tx (986237): CPU percent AVG 9.280000 MAX 20.000000
2021-10-25 14:13:15 INFO   P4P PV Gateway | TCP-tx (986229): CPU percent AVG 9.603333 MAX 13.300000
2021-10-25 14:13:15 INFO   P4P PV Gateway | TCP-tx (986233): CPU percent AVG 9.713333 MAX 13.300000
2021-10-25 14:13:15 INFO   P4P PV Gateway | TCP-rx (986228): CPU percent AVG 6.533333 MAX 6.700000
2021-10-25 14:13:15 INFO   P4P PV Gateway | TCP-rx (986230): CPU percent AVG 6.533333 MAX 6.700000
2021-10-25 14:13:15 INFO   P4P PV Gateway | TCP-rx (986232): CPU percent AVG 6.700000 MAX 6.700000
2021-10-25 14:13:15 INFO   P4P PV Gateway | TCP-rx (986234): CPU percent AVG 6.450000 MAX 6.700000
2021-10-25 14:13:15 INFO   P4P PV Gateway | TCP-rx (986236): CPU percent AVG 6.533333 MAX 6.700000

Two things stand out. First, v4.0 memory spiked up to 12GB and used on average over 20 times more memory. Second is that the GWQ and PVXTCP can max out the CPU core, with PVXTCP running most of time at full capacity. This can indicate that there may be a bottleneck in those threads.

I attached the graphs from a few runs here: graphs.zip

I hope that helps.

mdavidsaver commented 2 years ago

... significantly more memory ...

How do you come be this number? I this allocated virtual memory? Currently resident memory? Something else?

... that the GWQ and PVXTCP can max out the CPU core, with PVXTCP running most of time at full capacity ...

Ok, now we are getting somewhere. Can you do some profiling to see if there are any obvious hot spots in the code? eg. using linux-perf or similar which you or someone at APS has experience with.

echandler-anl commented 2 years ago

The memory is the resident memory from top.

I attached the the profile from perf running the test with 8 PV monitors.

p4pperf.zip

The .out file is from running perf report -n --stdio and the .svg is a framegraph from the profile. I noticed that for one of the runs it also crashed with the same "corrupt double-link list" error, and the memory usage went up to 20GB.

mdavidsaver commented 2 years ago

As a note, I've still been unable to reproduce a similar crash.

mdavidsaver commented 1 year ago

~1.5 years on, and I have not had any similar reports. Perhaps I have manged to fix whatever caused this crash incidentally?