aristanetworks / sonic

Open source drivers and initialization library for Arista platforms running SONiC
GNU General Public License v2.0
22 stars 30 forks source link

[chassis] memory leak in pcied #89

Closed arlakshm closed 1 year ago

arlakshm commented 1 year ago

In the production device, the memory utilization of pcied is high Pasting the email thread

From: Patrick MacArthur <pmacarthur@arista.com> 
Sent: Friday, May 19, 2023 3:20 PM
To: Prince George <Prince.George@microsoft.com>
Cc: Rita Hui <Rita.Hui@microsoft.com>; Arvindsrinivasan Lakshmi Narasimhan <Arvindsrinivasan.Lakshmi@microsoft.com>; kennethcheung <kennethcheung@arista.com>; aaronp <aaronp@arista.com>; Veronica Cojocaru <vcojocaru@arista.com>; sonic-ext-support <sonic-ext-support@arista.com>; Wenyi Zhang <wenyizhang@microsoft.com>; kartik <kartik@arista.com>
Subject: Re: [EXTERNAL] Re: pcied on the supervisor hogging memory

Hi, Prince,

It's a memory leak that causes the memory usage in pcied to grow over time. There are circular references that are not getting cleaned up by the Python garbage collector.

We are looking into this and trying to get a fix for this as soon as possible.

Thanks,
Patrick

On Fri, May 19, 2023 at 6:02 PM Prince George <Prince.George@microsoft.com> wrote:
Thanks for confirming. What is the step to reproduce?

From: Patrick MacArthur <pmacarthur@arista.com> 
Sent: Friday, May 19, 2023 2:57 PM
To: Prince George <Prince.George@microsoft.com>
Cc: Rita Hui <Rita.Hui@microsoft.com>; Arvindsrinivasan Lakshmi Narasimhan <Arvindsrinivasan.Lakshmi@microsoft.com>; kennethcheung <kennethcheung@arista.com>; aaronp <aaronp@arista.com>; Veronica Cojocaru <vcojocaru@arista.com>; sonic-ext-support <sonic-ext-support@arista.com>; Wenyi Zhang <wenyizhang@microsoft.com>; kartik <kartik@arista.com>
Subject: Re: [EXTERNAL] Re: pcied on the supervisor hogging memory

    You don't often get email from pmacarthur@arista.com. Learn why this is important

Hi, Prince, 

We have reproduced the memory leak locally, so we can debug the issue on our local test setup.

Thanks,
Patrick

On Fri, May 19, 2023 at 5:25 PM Prince George <Prince.George@microsoft.com> wrote:
Before restart, should we debug on the live system … we may not be able to recreate later…

From: Patrick MacArthur <pmacarthur@arista.com> 
Sent: Friday, May 19, 2023 1:04 PM
To: Rita Hui <Rita.Hui@microsoft.com>; Arvindsrinivasan Lakshmi Narasimhan <Arvindsrinivasan.Lakshmi@microsoft.com>
Cc: kennethcheung <kennethcheung@arista.com>; aaronp <aaronp@arista.com>; Veronica Cojocaru <vcojocaru@arista.com>; sonic-ext-support <sonic-ext-support@arista.com>; Prince George <Prince.George@microsoft.com>; Wenyi Zhang <wenyizhang@microsoft.com>; kartik <kartik@arista.com>
Subject: Re: [EXTERNAL] Re: pcied on the supervisor hogging memory

    You don't often get email from pmacarthur@arista.com. Learn why this is important

Hi, Arvind, 

We are currently looking into what the source of this memory leak may be.

In the meantime, it would be useful to obtain `/var/log/syslog*` and `/var/log/arista*` on the affected switch.

You should be able to recover the leaked memory by restarting the pcied service.

Please let us know if you have any other questions or concerns.

Thanks,
Patrick

On Fri, May 19, 2023 at 3:11 PM Rita Hui <Rita.Hui@microsoft.com> wrote:
Adding Kartik as well. This is the pilot device.

From: Rita Hui 
Sent: Friday, May 19, 2023 12:08 PM
To: Arvindsrinivasan Lakshmi Narasimhan <Arvindsrinivasan.Lakshmi@microsoft.com>; kennethcheung <kennethcheung@arista.com>; aaronp <aaronp@arista.com>; Patrick MacArthur <pmacarthur@arista.com>
Cc: Veronica Cojocaru <vcojocaru@arista.com>; sonic-ext-support <sonic-ext-support@arista.com>; Prince George <Prince.George@microsoft.com>; Wenyi Zhang <wenyizhang@microsoft.com>
Subject: RE: [EXTERNAL] Re: pcied on the supervisor hogging memory

Adding Wenyi as well.

From: Arvindsrinivasan Lakshmi Narasimhan <Arvindsrinivasan.Lakshmi@microsoft.com> 
Sent: Friday, May 19, 2023 12:00 PM
To: kennethcheung <kennethcheung@arista.com>; aaronp <aaronp@arista.com>; Patrick MacArthur <pmacarthur@arista.com>
Cc: Veronica Cojocaru <vcojocaru@arista.com>; sonic-ext-support <sonic-ext-support@arista.com>; Rita Hui <Rita.Hui@microsoft.com>; Prince George <Prince.George@microsoft.com>
Subject: RE: [EXTERNAL] Re: pcied on the supervisor hogging memory

Hi @Kenneth, @Patrick,
Can you please advise what the next steps are for this production issue.

Thanks,
Arvind

From: Kenneth Cheung <kennethcheung@arista.com> 
Sent: Thursday, May 18, 2023 3:05 PM
To: aaronp <aaronp@arista.com>; Patrick MacArthur <pmacarthur@arista.com>
Cc: Arvindsrinivasan Lakshmi Narasimhan <Arvindsrinivasan.Lakshmi@microsoft.com>; Veronica Cojocaru <vcojocaru@arista.com>; sonic-ext-support <sonic-ext-support@arista.com>; Rita Hui <Rita.Hui@microsoft.com>; Prince George <Prince.George@microsoft.com>
Subject: [EXTERNAL] Re: pcied on the supervisor hogging memory

+Patrick MacArthur 

On Thu, May 18, 2023 at 9:06 AM Aaron Payment <aaronp@arista.com> wrote:
+Kenneth Cheung +Veronica Cojocaru 

On Wed, May 17, 2023 at 5:14 PM 'Arvindsrinivasan Lakshmi Narasimhan' via sonic-ext-support <sonic-ext-support@arista.com> wrote:
Hi,
On the Arista 7808  chassis running SONiC, we are seeing an issue where the pcied process is hogging lot of memory on the supervisor module. Logs pasted below
I suspect there might be memory leak here. 
The device is still in this state, please let us know what logs we can collected to triage this issue.
We can also have debug session if needed. 

Logs
-----
admin@STG01-0101-0400-01T2-sup00:~$ docker stats --no-stream
CONTAINER ID   NAME               CPU %     MEM USAGE / LIMIT     MEM %     NET I/O         BLOCK I/O         PIDS
e0987182ba6e   acms               0.02%     49.45MiB / 62.79GiB   0.08%     0B / 0B         324kB / 97.8MB    9
ce908a7bf4b1   syncd10            5.96%     244.6MiB / 62.79GiB   0.38%     374MB / 401MB   349kB / 102kB     33
69b06047b900   syncd8             6.87%     239.8MiB / 62.79GiB   0.37%     374MB / 401MB   329kB / 102kB     33
d1d39e133a4f   syncd5             5.86%     241.9MiB / 62.79GiB   0.38%     376MB / 400MB   321kB / 102kB     33
10669c176fb4   syncd11            6.97%     247.4MiB / 62.79GiB   0.38%     373MB / 401MB   480kB / 102kB     33
ec60c5f42942   syncd7             6.00%     246.5MiB / 62.79GiB   0.38%     375MB / 401MB   326kB / 102kB     33
553f2a663751   syncd4             5.81%     246.4MiB / 62.79GiB   0.38%     377MB / 400MB   47.3MB / 102kB    33
33514d9d9022   teamd11            0.06%     33.46MiB / 62.79GiB   0.05%     373MB / 401MB   16.4kB / 90.1kB   12
7f02984795a3   teamd4             0.05%     33.45MiB / 62.79GiB   0.05%     377MB / 400MB   16.4kB / 90.1kB   12
53b75877afe3   syncd6             5.85%     243.6MiB / 62.79GiB   0.38%     376MB / 401MB   345kB / 102kB     33
882a96f821cb   syncd3             5.78%     249.6MiB / 62.79GiB   0.39%     377MB / 400MB   323kB / 102kB     33
bdd60876c907   syncd2             8.17%     251.9MiB / 62.79GiB   0.39%     377MB / 401MB   327kB / 102kB     33
c965686ffb4d   teamd2             0.06%     33.43MiB / 62.79GiB   0.05%     377MB / 401MB   16.4kB / 90.1kB   12
548e67d66db6   teamd5             0.04%     33.48MiB / 62.79GiB   0.05%     376MB / 400MB   16.4kB / 90.1kB   12
bc0118138f21   teamd6             0.07%     33.43MiB / 62.79GiB   0.05%     376MB / 401MB   16.4kB / 90.1kB   12
a22201165207   teamd10            0.06%     31.47MiB / 62.79GiB   0.05%     374MB / 401MB   16.4kB / 90.1kB   12
310d0ed6d3af   syncd0             6.66%     244.4MiB / 62.79GiB   0.38%     377MB / 394MB   349kB / 102kB     33
2ae91af90371   swss6              0.05%     37.43MiB / 62.79GiB   0.06%     376MB / 401MB   53.2kB / 119kB    13
4b1586a1652a   syncd1             5.90%     246.9MiB / 62.79GiB   0.38%     377MB / 394MB   153MB / 102kB     33
04bcbbc53cbe   teamd9             0.05%     33.45MiB / 62.79GiB   0.05%     373MB / 401MB   16.4kB / 90.1kB   12
bf99a5fd0423   syncd9             5.98%     243.6MiB / 62.79GiB   0.38%     373MB / 401MB   484kB / 102kB     33
ba12638e9c0f   teamd0             0.07%     32.18MiB / 62.79GiB   0.05%     377MB / 394MB   2.68MB / 90.1kB   12
870fb63c3e3d   teamd8             0.05%     33.43MiB / 62.79GiB   0.05%     374MB / 401MB   16.4kB / 90.1kB   12
61db7c3ff6ee   swss10             0.06%     39.71MiB / 62.79GiB   0.06%     374MB / 401MB   53.2kB / 119kB    13
51d7767d7d3a   swss9              0.05%     39.83MiB / 62.79GiB   0.06%     373MB / 401MB   53.2kB / 119kB    13
1ccddaf8ba99   teamd1             0.04%     31.48MiB / 62.79GiB   0.05%     377MB / 394MB   16.4kB / 90.1kB   12
1dde84d82c6d   swss5              0.04%     39.67MiB / 62.79GiB   0.06%     376MB / 400MB   49.2kB / 119kB    13
9c14f07e762f   teamd7             0.06%     31.45MiB / 62.79GiB   0.05%     375MB / 401MB   16.4kB / 90.1kB   12
9e9265ace236   swss1              0.04%     42.75MiB / 62.79GiB   0.07%     377MB / 394MB   9.46MB / 119kB    13
81433040e297   swss11             0.04%     39.69MiB / 62.79GiB   0.06%     373MB / 401MB   53.2kB / 119kB    13
dca7a163f6e0   teamd3             0.07%     35.43MiB / 62.79GiB   0.06%     377MB / 400MB   16.4kB / 90.1kB   12
26b046322ebe   swss0              0.05%     39.42MiB / 62.79GiB   0.06%     377MB / 394MB   53.2kB / 119kB    13
83b6f5854258   snmp               9.60%     71.99MiB / 62.79GiB   0.11%     0B / 0B         13.6MB / 131kB    9
5fd77905db8e   telemetry          0.04%     44.88MiB / 62.79GiB   0.07%     0B / 0B         22.5kB / 111kB    7
d28befa16abf   swss8              0.05%     37.72MiB / 62.79GiB   0.06%     374MB / 401MB   53.2kB / 119kB    13
024ea0ef8945   swss4              0.06%     41.36MiB / 62.79GiB   0.06%     377MB / 400MB   53.2kB / 119kB    13
b459c63518d8   swss2              0.04%     39.48MiB / 62.79GiB   0.06%     377MB / 401MB   49.2kB / 119kB    13
f7fd7cad2177   radv               0.03%     31.48MiB / 62.79GiB   0.05%     0B / 0B         1.08MB / 77.8kB   6
7cc5ca2825ec   swss7              0.05%     47.64MiB / 62.79GiB   0.07%     375MB / 401MB   53.2kB / 119kB    13
2877cd4eb6da   swss3              0.05%     39.92MiB / 62.79GiB   0.06%     377MB / 400MB   53.2kB / 119kB    13
8912c4393851   lldp               0.03%     56.88MiB / 62.79GiB   0.09%     0B / 0B         8.19MB / 115kB    11
96d5c9b4c939   pmon               8.33%     33.75GiB / 62.79GiB   53.74%    0B / 0B         6.05MB / 115kB    13
e10867fb5900   database8          0.33%     53.91MiB / 62.79GiB   0.08%     374MB / 401MB   418kB / 73.7kB    11
19e0c856b342   database9          0.37%     53.77MiB / 62.79GiB   0.08%     373MB / 401MB   312kB / 73.7kB    11
37366c96535b   database6          0.36%     51.58MiB / 62.79GiB   0.08%     376MB / 401MB   307kB / 73.7kB    11
2b23bfe39338   database7          0.41%     53.75MiB / 62.79GiB   0.08%     375MB / 401MB   307kB / 73.7kB    11
7924bf46c2a5   database5          0.39%     53.83MiB / 62.79GiB   0.08%     376MB / 400MB   312kB / 73.7kB    11
d4c64da79d55   database4          0.41%     51.92MiB / 62.79GiB   0.08%     377MB / 400MB   307kB / 73.7kB    11
a5d303364163   database1          0.36%     52.03MiB / 62.79GiB   0.08%     377MB / 394MB   488kB / 73.7kB    11
a0e7fb7792de   database11         0.43%     53.67MiB / 62.79GiB   0.08%     373MB / 401MB   312kB / 73.7kB    11
a5859dd2d3ae   database0          0.40%     51.77MiB / 62.79GiB   0.08%     377MB / 394MB   307kB / 73.7kB    11
c2e99b5e48f9   database2          0.32%     51.62MiB / 62.79GiB   0.08%     377MB / 401MB   307kB / 73.7kB    11
4a446ff886f5   database3          0.35%     51.59MiB / 62.79GiB   0.08%     377MB / 400MB   307kB / 73.7kB    11
8a662b5cd524   database10         0.32%     49.9MiB / 62.79GiB    0.08%     374MB / 401MB   307kB / 73.7kB    11
af73a7e06deb   database           1.19%     47.62MiB / 62.79GiB   0.07%     0B / 0B         308kB / 69.6kB    11
ee4f7377ac01   database-chassis   0.39%     76.94MiB / 62.79GiB   0.12%     0B / 0B         44.3MB / 69.6kB   11
admin@STG01-0101-0400-01T2-sup00:~$ top -o %MEM -d 10 -n 1
top - 23:59:28 up 16 days, 13:53,  4 users,  load average: 6.86, 5.98, 4.23
Tasks: 723 total,   1 running, 719 sleeping,   0 stopped,   3 zombie
%Cpu(s): 38.5 us,  1.3 sy,  0.0 ni, 60.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  64299.1 total,  12560.6 free,  42861.0 used,   8877.4 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  20660.8 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                           
  11700 root      20   0   33.5g  33.5g  13504 S   0.0  53.3 154:29.86 pcied                                                                             
  18224 root      20   0 1753908 344456 119820 S  11.8   0.5   4352:00 syncd                                                                             
  18402 root      20   0 1753788 341324 120900 S   5.9   0.5   4333:11 syncd                                                                             
  17986 root      20   0 1753784 339404 120180 S 100.0   0.5   4327:56 syncd                                                                             
  17583 root      20   0 1753784 337272 119556 S 100.0   0.5   4320:16 syncd                                                                             
  17858 root      20   0 1753764 337204 119892 S  11.8   0.5   4344:43 syncd                                                                             
  17936 root      20   0 1753780 336584 120504 S  11.8   0.5   4338:41 syncd                                                                             
  18200 root      20   0 1753780 336392 120500 S 100.0   0.5   4323:44 syncd                                                                             
  17974 root      20   0 1753784 335716 119536 S 100.0   0.5   4323:24 syncd                                                                             
  17846 root      20   0 1753784 335060 119948 S   0.0   0.5   4336:11 syncd                                                                             
  18278 root      20   0 1753780 334928 120416 S   0.0   0.5   4339:34 syncd                                                                             
  17683 root      20   0 1753788 333260 120524 S   0.0   0.5   4334:37 syncd                                                                             
  18342 root      20   0 1753780 329876 119424 S   0.0   0.5   4344:30 syncd                                                                             
    911 root      20   0 7063192 151280  50156 S   0.0   0.2 520:10.70 dockerd                                                                           
    834 root      20   0 2161332  74320  31720 S   0.0   0.1 355:23.01 containerd                                                                        
   6722 root      20   0 1868012  64512  32220 S   0.0   0.1   2:40.66 docker                                                                            
   6886 root      20   0 1868012  64320  31912 S   0.0   0.1   2:38.15 docker                                                                            
   2646 root      20   0 1867500  62488  31744 S   0.0   0.1   2:34.48 docker                                                                            
   6941 root      20   0 1867244  62304  32220 S   0.0   0.1   2:36.03 docker                                                                            
   6823 root      20   0 1868012  61132  31764 S   0.0   0.1   2:34.51 docker                                                                            
   6832 root      20   0 1867756  61028  32156 S   0.0   0.1   2:38.09 docker                                                                            
   6859 root      20   0 1868012  60636  32400 S   0.0   0.1   2:37.40 docker                                                                            
   7395 root      20   0 1867500  60584  32456 S   0.0   0.1   2:38.54 docker                                                                            
   7346 root      20   0 1867756  60492  31772 S   0.0   0.1   2:40.57 docker                                                                            
   6780 root      20   0 1867244  60048  32224 S   0.0   0.1   2:37.78 docker                                                                            
   6792 root      20   0 1867756  59664  32484 S   0.0   0.1   2:36.04 docker                                                                            
   2198 root      20   0 1867756  58900  32344 S   0.0   0.1   2:36.70 docker                                                                            
   7177 root      20   0 1867756  58812  31976 S   0.0   0.1   2:34.87 docker                                                                            
   7029 root      20   0 1866092  58556  31984 S   0.0   0.1   2:31.98 docker                                                                            
   7461 root      20   0   62964  47856  14716 S   0.0   0.1 118:18.82 healthd                                                                           
   7534 root      20   0   71240  42136  10508 S   0.0   0.1   0:03.74 healthd                                                                           
   7524 root      20   0   64076  40844  10084 S   0.0   0.1   0:39.74 healthd                                                                           
admin@STG01-0101-0400-01T2-sup00:~$  
Staphylo commented 1 year ago

@arlakshm the issue has been root caused and we have a fix for it. It should make it to this repo soon.

Staphylo commented 1 year ago

Should be fixed by the following PRs master: https://github.com/sonic-net/sonic-buildimage/pull/15405 202205: https://github.com/sonic-net/sonic-buildimage/pull/15406 202211: https://github.com/sonic-net/sonic-buildimage/pull/15407