Memory leak when receiving notifications from another atSign

cconstab commented 2 years ago

Describe the bug When receiving a notification from another atSign the memory foot print of the secondary increases and never goes down. Eventually the docker container constraints are hit and the secondary restarted and the process continues.

To Reproduce Steps to reproduce the behavior:

First constantly send notifications from one atSign to another (in this case an update a second)
Then use the docker stats to see the status of the memory footprint of the docker container
And then wait for it to crash and report out of memory

Expected behavior Memory to be used and then be returned to the heap

Screenshots If applicable, add screenshots to help explain your problem.

screen capture of leak on receiving atSign @ visual61

screen capture of sending atSign @ kryz_9850

Smartphone (please complete the following information):

production secondary

@info
data:{"version":"3.0.20","uptimeAsWords":"23 minutes 26 seconds","features":[{"name":"noop:","status":"Beta","description":"The No-Op verb simply does nothing for the requested number of milliseconds. The requested number of milliseconds may not be greater than 5000. Upon completion, the noop verb sends 'ok' as a response to the client.","syntax":"^noop:(?<delayMillis>\\d+)$"},{"name":"info:","status":"Beta","description":"The Info verb returns some information about the server including uptime and some info about available features. ","syntax":"^info(:brief)?$"}]}
@scan
data:["cached:firstname.wavi@colin","cached:image.wavi@colin","cached:lastname.wavi@colin","cached:publickey@colin","publickey","publickey@visual61","signing_publickey@visual61"]
@

Were you using an @‎application when the bug was found?

KRYZ_9850 one second updates via notification with 3-6 clients connected to receive the notifications

Additional context CPU also seems much higher than expeected and is pined at 25% (the max for a container) I increased the memory limit to 1000M for @ visual61 to prove the point.

cconstab commented 2 years ago

@gkc I will defer to you on the P1 ness of this issue..

sitaram-kalluri commented 2 years ago

Tested with trunk branch code (ran secondaries in local via the source code) and observed the garbage collection is being triggered and memory is released.

Further, I have looked at some of the articles which says that there might be possibility of memory leaks when running process in the docker containers (which may or may not be correct in our case). Below are links to articles

Test 1: Ran a simple server socket with a hive box initialized:

memory_of_hive_with_simple_socket

Test 2: Memory print of new secondary server - (only signing private keys and cram secret key) memory_of_new_secondary

Test 3: Pumping notification's and receiver atsign is memory is being garbage collected - video recording of secondary (atsign receiving the notifications) running in local and GC triggered:

https://user-images.githubusercontent.com/16645698/185161628-f49dfa5a-c433-4ca5-8c81-ababbbcbd792.mp4

VJag commented 2 years ago

Document describing ways to delete notifications: https://docs.google.com/document/d/1I0kWMWWTfTS89f9I2fCZeu42JgH2wtngyd-L7gQJMEA/edit?usp=sharing

sitaram-kalluri commented 2 years ago

Identified areas in code where the memory consumption is high. Optimised the code to reduced the memory usage and pushed the changes to fix_memory_issues.

More on our investigation is documented in the following docs: https://docs.google.com/document/d/1JiGDECUmIwrPb5wO3L-3UY2ccQuis-NVIx_RbviB33Q/edit?usp=sharing

sitaram-kalluri commented 2 years ago

@cconstab : Following is the docker image built from the fix_memory_issues branch for testing: atsigncompany/secondary:memtest

cconstab commented 2 years ago

Wow this looks very good already.. Test been running for 30 mins so far.. I will leave running over night..

CONTAINER ID   NAME                                                   CPU %     MEM USAGE / LIMIT    MEM %     NET I/O           BLOCK I/O         PIDS
2af7f28f90b3   atgps-receiver_secondary.1.r7oeuaaxmxwiul13am4v5rd8m   0.88%     58.4MiB / 3.84GiB    1.49%     8.53MB / 4.92MB   8.19kB / 1.54MB   9
047f8c38399e   atgps01_secondary.1.k18tepubin767he45rlnehyvi          1.31%     59.26MiB / 3.84GiB   1.51%     2.3MB / 2.15MB    28.7kB / 7.96MB   7
d83b136e8afd   atgps02_secondary.1.jt6tupqf19eskd590dzm97elk          0.94%     58.54MiB / 3.84GiB   1.49%     2.23MB / 2.08MB   28.7kB / 7.97MB   7

And after a few hours the peak mem usage I gave seen is 150M that was also as CPU peaked (tidy up job ??)

CONTAINER ID   NAME                                                   CPU %     MEM USAGE / LIMIT    MEM %     NET I/O          BLOCK I/O        PIDS
2af7f28f90b3   atgps-receiver_secondary.1.r7oeuaaxmxwiul13am4v5rd8m   20.58%    90.57MiB / 3.84GiB   2.30%     120MB / 66.5MB   254kB / 21.4MB   12
047f8c38399e   atgps01_secondary.1.k18tepubin767he45rlnehyvi          1.19%     66.38MiB / 3.84GiB   1.69%     31MB / 29MB      1.44MB / 119MB   10
d83b136e8afd   atgps02_secondary.1.jt6tupqf19eskd590dzm97elk          1.15%     87.02MiB / 3.84GiB   2.21%     31MB / 28.9MB    1.35MB / 110MB   9

CONTAINER ID   NAME                                                   CPU %     MEM USAGE / LIMIT    MEM %     NET I/O          BLOCK I/O        PIDS
2af7f28f90b3   atgps-receiver_secondary.1.r7oeuaaxmxwiul13am4v5rd8m   20.58%    90.57MiB / 3.84GiB   2.30%     120MB / 66.5MB   254kB / 21.4MB   12
047f8c38399e   atgps01_secondary.1.k18tepubin767he45rlnehyvi          1.19%     66.38MiB / 3.84GiB   1.69%     31MB / 29MB      1.44MB / 119MB   10
d83b136e8afd   atgps02_secondary.1.jt6tupqf19eskd590dzm97elk          1.15%     87.02MiB / 3.84GiB   2.21%     31MB / 28.9MB    1.35MB / 110MB   9

CONTAINER ID   NAME                                                   CPU %     MEM USAGE / LIMIT    MEM %     NET I/O          BLOCK I/O        PIDS
2af7f28f90b3   atgps-receiver_secondary.1.r7oeuaaxmxwiul13am4v5rd8m   1.64%     90.99MiB / 3.84GiB   2.31%     120MB / 66.5MB   254kB / 21.4MB   12
047f8c38399e   atgps01_secondary.1.k18tepubin767he45rlnehyvi          1.26%     66.63MiB / 3.84GiB   1.69%     31MB / 29MB      1.44MB / 119MB   10
d83b136e8afd   atgps02_secondary.1.jt6tupqf19eskd590dzm97elk          1.32%     87.02MiB / 3.84GiB   2.21%     31MB / 28.9MB    1.35MB / 110MB   9

CONTAINER ID   NAME                                                   CPU %     MEM USAGE / LIMIT    MEM %     NET I/O          BLOCK I/O        PIDS
2af7f28f90b3   atgps-receiver_secondary.1.r7oeuaaxmxwiul13am4v5rd8m   1.64%     90.99MiB / 3.84GiB   2.31%     120MB / 66.5MB   254kB / 21.4MB   12
047f8c38399e   atgps01_secondary.1.k18tepubin767he45rlnehyvi          1.26%     66.63MiB / 3.84GiB   1.69%     31MB / 29MB      1.44MB / 119MB   10
d83b136e8afd   atgps02_secondary.1.jt6tupqf19eskd590dzm97elk          1.32%     87.02MiB / 3.84GiB   2.21%     31MB / 28.9MB    1.35MB / 110MB   9

cconstab commented 2 years ago

Using the latest branch and a few observations so far..

At start up of secondary huge amounts of RAM and CPU usage (up to 500mb and 127% CPU for may be a minute or two)

After things settle down I still see spikes of memory to about 300Mb and over time that trims back down to around 100Mb

More details as the tests runs

Huge improvements but more to learn no doubt

cconstab commented 2 years ago

Still seeing significant spikes but not significant overall memory usage.. Thsi is what I see after 12 hours of 2 hertz data on gp01 and gps02 and 4 hertz on the receiver.

CONTAINER ID   NAME                                                   CPU %     MEM USAGE / LIMIT    MEM %     NET I/O           BLOCK I/O        PIDS
973515ac1428   atgps01_secondary.1.q6fjd6as39qu03j511d19a653          1.82%     133.9MiB / 3.84GiB   3.40%     70.8MB / 65.9MB   9.56MB / 765MB   11
8115efd2a1f0   atgps02_secondary.1.8g7art9jebxnns640mv65k7jq          1.08%     142.5MiB / 3.84GiB   3.62%     70.8MB / 66MB     10.3MB / 765MB   11
60573ec81975   atgps-receiver_secondary.1.3076e62yowebe8d8kb5nepj8h   1.21%     217.9MiB / 3.84GiB   5.54%     282MB / 155MB     8.84MB / 497MB   13

If we can do more great but excellent progress in just this week.. The sender and receiver and secondaries have been up now for 12 plus hours..

cconstab commented 2 years ago

Another important point is to make sure the housekeeping jobs are NOT using a cron mechanism as that would mean they all hit the underlying hardware at the sametime. Instead I think we should use a delta from startup this way the load is spread across time more randomly. @VJag

I am also documenting my GPS test rig and capturing the memory/CPU logs of each secondary so I can graph it out for you all.

VJag commented 2 years ago

@cconstab I agree. Will try adding the delta on startup thing to the branch you are testing.

cconstab commented 2 years ago

after 21 hours of uptime

60573ec81975   atgps-receiver_secondary.1.3076e62yowebe8d8kb5nepj8h   1.23%     137.1MiB / 3.84GiB   3.49%     485MB / 268MB   8.91MB / 1.05GB   13
973515ac1428   atgps01_secondary.1.q6fjd6as39qu03j511d19a653          0.85%     79.14MiB / 3.84GiB   2.01%     122MB / 114MB   10.3MB / 1.42GB   11
8115efd2a1f0   atgps02_secondary.1.8g7art9jebxnns640mv65k7jq          1.36%     104.3MiB / 3.84GiB   2.65%     122MB / 114MB   10.8MB / 1.44GB   11

cconstab commented 2 years ago

Testing the latest version over the weekend.. New stats avail by Monday India time

cconstab commented 2 years ago

After running all weekend om the latest build wow!

CONTAINER ID   NAME                                                   CPU %     MEM USAGE / LIMIT    MEM %     NET I/O          BLOCK I/O         PIDS
d3559a0f815e   atgps-receiver_secondary.1.24afejlvqhz2rhszueol34jm3   0.56%     83.97MiB / 3.84GiB   2.14%     1.09GB / 650MB   1.08MB / 1.46GB   9
d9c941a91402   atgps01_secondary.1.zszxgut0dc4p3o25ef7h3k8d9          1.16%     87.61MiB / 3.84GiB   2.23%     318MB / 296MB    2.59MB / 2.86GB   7
4834b40fe5e2   atgps02_secondary.1.znmh6noiintd09oq1r1xsl53q          1.22%     81.84MiB / 3.84GiB   2.08%     178MB / 167MB    1.7MB / 1.58GB    7

VJag commented 2 years ago

@cconstab I expect peak CPU and memory usage to be less than the previous run. Do you have any observations on the peak usage?

cconstab commented 2 years ago

Tried new branch that aims to drop out of date notifications using


dependency_overrides:
  at_client:
    git:
      url: https://github.com/atsign-foundation/at_client_sdk.git
      ref: enhance_notify_text
      path: at_client

Unfortunately this results in the onboarding failing (I tested with a new atSign as well

PS C:\Users\colin\GitHub\@company\at_gps_demo>  dart bin/at_gps_receiver.dart -r "@atgps_receiver" -n "car1" -m "192.168.1.149" -t "v1/devices/me/telemetry" -u "TJw0tp2F97D0adWlEEQC" -f "@atgps01" -v
INFO|2022-08-30 14:39:06.800777|HiveBase|commit_log_68263172de4d62cf28782b8705e155afb7e34f7512c2729ede0a8b9873d4b614 initialized successfully 

INFO|2022-08-30 14:39:06.822776|HiveBase|68263172de4d62cf28782b8705e155afb7e34f7512c2729ede0a8b9873d4b614 initialized successfully 

INFO|2022-08-30 14:39:06.871780|Monitor|starting monitor for @atgps_receiver with lastNotificationTime: 1661874419405 

INFO|2022-08-30 14:39:07.547421|Monitor|monitor started for @atgps_receiver with last notification time: 1661874419405 

INFO|2022-08-30 14:39:07.555941|AtLookup|Creating new connection

INFO|2022-08-30 14:39:07.603941|AtLookup|Creating new connection 

SEVERE|2022-08-30 14:39:07.687941|GPS reciever |Uncaught error: Exception: Failed connecting to @atgps_receiver. error:AT0015-key not found : private:_bc574e10-c9db-49c6-95fe-217d75d394c8@atgps_receiver does not exist in keystore 

SEVERE|2022-08-30 14:39:07.688941|GPS reciever |#0      AtLookupImpl.authenticate (package:at_lookup/src/at_lookup_impl.dart:405:11)
<asynchronous suspension>
#1      AtOnboardingServiceImpl.authenticate (package:at_onboarding_cli/src/at_onboarding_service_impl.dart:204:11)
<asynchronous suspension>
#2      gpsMqtt (file:///C:/Users/colin/GitHub/@company/at_gps_demo/bin/at_gps_receiver.dart:114:3)
<asynchronous suspension>
#3      main.<anonymous closure> (file:///C:/Users/colin/GitHub/@company/at_gps_demo/bin/at_gps_receiver.dart:26:5)
<asynchronous suspension>

@kalluriramkumar ideas welcome !

My code is here

https://github.com/atsign-foundation/at_gps_demo

cconstab commented 2 years ago

The smoking gun ?

FINER|2022-08-30 16:13:03.230429|AtLookup|SENDING: from:@atgps_receiver:clientConfig:{"version":"3.0.36"}

Seems I need a secondary that can cope with the new from: verb constructs ?

As the current secondary under test is throwing

FINER|2022-08-30 16:13:13.534696|OutboundMessageListener|RECEIVED error:AT0011-Internal server exception : invalid @sign: Cannot contain !*'``();:&=+$,/?#[]{} characters

@kalluriramkumar @VJag

cconstab commented 2 years ago

Temp work around created branch and removed the client string

dependency_overrides:
  at_client:
    git:
      url: https://github.com/atsign-foundation/at_client_sdk.git
      ref: enhance_notify_text_colin
      path: at_client

sitaram-kalluri commented 2 years ago

The smoking gun ?

FINER|2022-08-30 16:13:03.230429|AtLookup|SENDING: from:@atgps_receiver:clientConfig:{"version":"3.0.36"}

Seems I need a secondary that can cope with the new from: verb constructs ?

As the current secondary under test is throwing

FINER|2022-08-30 16:13:13.534696|OutboundMessageListener|RECEIVED error:AT0011-Internal server exception : invalid @sign: Cannot contain !*'();:&=+$,/?#[]{} characters ``

@kalluriramkumar @VJag

@cconstab : This is because I checked-out a branch which has the changes related to 'Sending client config changes to server' and since the corresponding changes on server are not deployed to prod secondaries you are experiencing the above mentioned issue. I removed the line which sends client config to server (the same fix that you made in your local branch) and pushed to enhance_notify_text.

cconstab commented 2 years ago

Wow... Having messages that are set to 'strategyEnum:` on the notification params... E.g.

void sendGps(String fromAtsign, String toAtsign, String nameSpace, String deviceName,
    NotificationService notificationService, AtSignLogger logger, String input) async {
  if (!(input == "")) {
    try {
      await notificationService.notify(NotificationParams.forText(input, toAtsign, shouldEncrypt: true, notifier: deviceName, strategyEnum: StrategyEnum.latest),checkForFinalDeliveryStatus: false,
          onSuccess: (notification) {
        logger.info('SUCCESS:$notification');
      }, onError: (notification) {
        logger.info('ERROR:$notification');
      }, onSentToSecondary: (notification) {
        logger.info('SENT:$notification');
      }, waitForFinalDeliveryStatus: false);
    } catch (e) {
      logger.severe(e.toString());
    }
  }
}

Results in very low memory fortprint on secondary so far (4 hours up so far)

CONTAINER ID   NAME                                                   CPU %     MEM USAGE / LIMIT    MEM %     NET I/O           BLOCK I/O        PIDS
4ebe0e6d4cd0   atgps-receiver_secondary.1.akd5d6iwubicgea3x28l9ccpr   0.91%     65.54MiB / 3.84GiB   1.67%     67.2MB / 40.3MB   0B / 92MB        10
51e3de29cecb   atgps01_secondary.1.931t6k6j49q3kdj8ftoxjrs63          0.54%     33.1MiB / 3.84GiB    0.84%     14.1MB / 14.3MB   16.4kB / 144MB   8
ae1b1bf2b14e   atgps02_secondary.1.ozj1kzbmv4gir80jueywz31ub          0.64%     34.02MiB / 3.84GiB   0.87%     13.7MB / 13.9MB   4.1kB / 141MB    8

Willl monitor for a few more days

gkc commented 2 years ago

@cconstab is this particular ticket ready to be closed?

gkc commented 2 years ago

Will close once has been verified in Canary and released to production.

gkc commented 2 years ago

This has been stress-tested pretty comprehensively.

atsign-foundation / at_server

Memory leak when receiving notifications from another atSign #856