Closed cconstab closed 2 years ago
@gkc I will defer to you on the P1 ness of this issue..
Tested with trunk branch code (ran secondaries in local via the source code) and observed the garbage collection is being triggered and memory is released.
Further, I have looked at some of the articles which says that there might be possibility of memory leaks when running process in the docker containers (which may or may not be correct in our case). Below are links to articles
Test 1: Ran a simple server socket with a hive box initialized:
Test 2: Memory print of new secondary server - (only signing private keys and cram secret key)
Test 3: Pumping notification's and receiver atsign is memory is being garbage collected - video recording of secondary (atsign receiving the notifications) running in local and GC triggered:
Document describing ways to delete notifications: https://docs.google.com/document/d/1I0kWMWWTfTS89f9I2fCZeu42JgH2wtngyd-L7gQJMEA/edit?usp=sharing
Identified areas in code where the memory consumption is high. Optimised the code to reduced the memory usage and pushed the changes to fix_memory_issues.
More on our investigation is documented in the following docs: https://docs.google.com/document/d/1JiGDECUmIwrPb5wO3L-3UY2ccQuis-NVIx_RbviB33Q/edit?usp=sharing
@cconstab : Following is the docker image built from the fix_memory_issues
branch for testing: atsigncompany/secondary:memtest
Wow this looks very good already.. Test been running for 30 mins so far.. I will leave running over night..
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
2af7f28f90b3 atgps-receiver_secondary.1.r7oeuaaxmxwiul13am4v5rd8m 0.88% 58.4MiB / 3.84GiB 1.49% 8.53MB / 4.92MB 8.19kB / 1.54MB 9
047f8c38399e atgps01_secondary.1.k18tepubin767he45rlnehyvi 1.31% 59.26MiB / 3.84GiB 1.51% 2.3MB / 2.15MB 28.7kB / 7.96MB 7
d83b136e8afd atgps02_secondary.1.jt6tupqf19eskd590dzm97elk 0.94% 58.54MiB / 3.84GiB 1.49% 2.23MB / 2.08MB 28.7kB / 7.97MB 7
And after a few hours the peak mem usage I gave seen is 150M that was also as CPU peaked (tidy up job ??)
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
2af7f28f90b3 atgps-receiver_secondary.1.r7oeuaaxmxwiul13am4v5rd8m 20.58% 90.57MiB / 3.84GiB 2.30% 120MB / 66.5MB 254kB / 21.4MB 12
047f8c38399e atgps01_secondary.1.k18tepubin767he45rlnehyvi 1.19% 66.38MiB / 3.84GiB 1.69% 31MB / 29MB 1.44MB / 119MB 10
d83b136e8afd atgps02_secondary.1.jt6tupqf19eskd590dzm97elk 1.15% 87.02MiB / 3.84GiB 2.21% 31MB / 28.9MB 1.35MB / 110MB 9
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
2af7f28f90b3 atgps-receiver_secondary.1.r7oeuaaxmxwiul13am4v5rd8m 20.58% 90.57MiB / 3.84GiB 2.30% 120MB / 66.5MB 254kB / 21.4MB 12
047f8c38399e atgps01_secondary.1.k18tepubin767he45rlnehyvi 1.19% 66.38MiB / 3.84GiB 1.69% 31MB / 29MB 1.44MB / 119MB 10
d83b136e8afd atgps02_secondary.1.jt6tupqf19eskd590dzm97elk 1.15% 87.02MiB / 3.84GiB 2.21% 31MB / 28.9MB 1.35MB / 110MB 9
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
2af7f28f90b3 atgps-receiver_secondary.1.r7oeuaaxmxwiul13am4v5rd8m 1.64% 90.99MiB / 3.84GiB 2.31% 120MB / 66.5MB 254kB / 21.4MB 12
047f8c38399e atgps01_secondary.1.k18tepubin767he45rlnehyvi 1.26% 66.63MiB / 3.84GiB 1.69% 31MB / 29MB 1.44MB / 119MB 10
d83b136e8afd atgps02_secondary.1.jt6tupqf19eskd590dzm97elk 1.32% 87.02MiB / 3.84GiB 2.21% 31MB / 28.9MB 1.35MB / 110MB 9
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
2af7f28f90b3 atgps-receiver_secondary.1.r7oeuaaxmxwiul13am4v5rd8m 1.64% 90.99MiB / 3.84GiB 2.31% 120MB / 66.5MB 254kB / 21.4MB 12
047f8c38399e atgps01_secondary.1.k18tepubin767he45rlnehyvi 1.26% 66.63MiB / 3.84GiB 1.69% 31MB / 29MB 1.44MB / 119MB 10
d83b136e8afd atgps02_secondary.1.jt6tupqf19eskd590dzm97elk 1.32% 87.02MiB / 3.84GiB 2.21% 31MB / 28.9MB 1.35MB / 110MB 9
Using the latest branch and a few observations so far..
At start up of secondary huge amounts of RAM and CPU usage (up to 500mb and 127% CPU for may be a minute or two)
After things settle down I still see spikes of memory to about 300Mb and over time that trims back down to around 100Mb
More details as the tests runs
Huge improvements but more to learn no doubt
Still seeing significant spikes but not significant overall memory usage.. Thsi is what I see after 12 hours of 2 hertz data on gp01 and gps02 and 4 hertz on the receiver.
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
973515ac1428 atgps01_secondary.1.q6fjd6as39qu03j511d19a653 1.82% 133.9MiB / 3.84GiB 3.40% 70.8MB / 65.9MB 9.56MB / 765MB 11
8115efd2a1f0 atgps02_secondary.1.8g7art9jebxnns640mv65k7jq 1.08% 142.5MiB / 3.84GiB 3.62% 70.8MB / 66MB 10.3MB / 765MB 11
60573ec81975 atgps-receiver_secondary.1.3076e62yowebe8d8kb5nepj8h 1.21% 217.9MiB / 3.84GiB 5.54% 282MB / 155MB 8.84MB / 497MB 13
If we can do more great but excellent progress in just this week.. The sender and receiver and secondaries have been up now for 12 plus hours..
Another important point is to make sure the housekeeping jobs are NOT using a cron mechanism as that would mean they all hit the underlying hardware at the sametime. Instead I think we should use a delta from startup this way the load is spread across time more randomly. @VJag
I am also documenting my GPS test rig and capturing the memory/CPU logs of each secondary so I can graph it out for you all.
@cconstab I agree. Will try adding the delta on startup thing to the branch you are testing.
after 21 hours of uptime
60573ec81975 atgps-receiver_secondary.1.3076e62yowebe8d8kb5nepj8h 1.23% 137.1MiB / 3.84GiB 3.49% 485MB / 268MB 8.91MB / 1.05GB 13
973515ac1428 atgps01_secondary.1.q6fjd6as39qu03j511d19a653 0.85% 79.14MiB / 3.84GiB 2.01% 122MB / 114MB 10.3MB / 1.42GB 11
8115efd2a1f0 atgps02_secondary.1.8g7art9jebxnns640mv65k7jq 1.36% 104.3MiB / 3.84GiB 2.65% 122MB / 114MB 10.8MB / 1.44GB 11
Testing the latest version over the weekend.. New stats avail by Monday India time
After running all weekend om the latest build wow!
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
d3559a0f815e atgps-receiver_secondary.1.24afejlvqhz2rhszueol34jm3 0.56% 83.97MiB / 3.84GiB 2.14% 1.09GB / 650MB 1.08MB / 1.46GB 9
d9c941a91402 atgps01_secondary.1.zszxgut0dc4p3o25ef7h3k8d9 1.16% 87.61MiB / 3.84GiB 2.23% 318MB / 296MB 2.59MB / 2.86GB 7
4834b40fe5e2 atgps02_secondary.1.znmh6noiintd09oq1r1xsl53q 1.22% 81.84MiB / 3.84GiB 2.08% 178MB / 167MB 1.7MB / 1.58GB 7
@cconstab I expect peak CPU and memory usage to be less than the previous run. Do you have any observations on the peak usage?
Tried new branch that aims to drop out of date notifications using
dependency_overrides:
at_client:
git:
url: https://github.com/atsign-foundation/at_client_sdk.git
ref: enhance_notify_text
path: at_client
Unfortunately this results in the onboarding failing (I tested with a new atSign as well
PS C:\Users\colin\GitHub\@company\at_gps_demo> dart bin/at_gps_receiver.dart -r "@atgps_receiver" -n "car1" -m "192.168.1.149" -t "v1/devices/me/telemetry" -u "TJw0tp2F97D0adWlEEQC" -f "@atgps01" -v
INFO|2022-08-30 14:39:06.800777|HiveBase|commit_log_68263172de4d62cf28782b8705e155afb7e34f7512c2729ede0a8b9873d4b614 initialized successfully
INFO|2022-08-30 14:39:06.822776|HiveBase|68263172de4d62cf28782b8705e155afb7e34f7512c2729ede0a8b9873d4b614 initialized successfully
INFO|2022-08-30 14:39:06.871780|Monitor|starting monitor for @atgps_receiver with lastNotificationTime: 1661874419405
INFO|2022-08-30 14:39:07.547421|Monitor|monitor started for @atgps_receiver with last notification time: 1661874419405
INFO|2022-08-30 14:39:07.555941|AtLookup|Creating new connection
INFO|2022-08-30 14:39:07.603941|AtLookup|Creating new connection
SEVERE|2022-08-30 14:39:07.687941|GPS reciever |Uncaught error: Exception: Failed connecting to @atgps_receiver. error:AT0015-key not found : private:_bc574e10-c9db-49c6-95fe-217d75d394c8@atgps_receiver does not exist in keystore
SEVERE|2022-08-30 14:39:07.688941|GPS reciever |#0 AtLookupImpl.authenticate (package:at_lookup/src/at_lookup_impl.dart:405:11)
<asynchronous suspension>
#1 AtOnboardingServiceImpl.authenticate (package:at_onboarding_cli/src/at_onboarding_service_impl.dart:204:11)
<asynchronous suspension>
#2 gpsMqtt (file:///C:/Users/colin/GitHub/@company/at_gps_demo/bin/at_gps_receiver.dart:114:3)
<asynchronous suspension>
#3 main.<anonymous closure> (file:///C:/Users/colin/GitHub/@company/at_gps_demo/bin/at_gps_receiver.dart:26:5)
<asynchronous suspension>
@kalluriramkumar ideas welcome !
My code is here
The smoking gun ?
FINER|2022-08-30 16:13:03.230429|AtLookup|SENDING: from:@atgps_receiver:clientConfig:{"version":"3.0.36"}
Seems I need a secondary that can cope with the new from: verb constructs ?
As the current secondary under test is throwing
FINER|2022-08-30 16:13:13.534696|OutboundMessageListener|RECEIVED error:AT0011-Internal server exception : invalid @sign: Cannot contain !*'``();:&=+$,/?#[]{} characters
@kalluriramkumar @VJag
Temp work around created branch and removed the client string
dependency_overrides:
at_client:
git:
url: https://github.com/atsign-foundation/at_client_sdk.git
ref: enhance_notify_text_colin
path: at_client
The smoking gun ?
FINER|2022-08-30 16:13:03.230429|AtLookup|SENDING: from:@atgps_receiver:clientConfig:{"version":"3.0.36"}
Seems I need a secondary that can cope with the new from: verb constructs ?
As the current secondary under test is throwing
FINER|2022-08-30 16:13:13.534696|OutboundMessageListener|RECEIVED error:AT0011-Internal server exception : invalid @sign: Cannot contain !*'
();:&=+$,/?#[]{} characters ``@kalluriramkumar @VJag
@cconstab : This is because I checked-out a branch which has the changes related to 'Sending client config changes to server' and since the corresponding changes on server are not deployed to prod secondaries you are experiencing the above mentioned issue. I removed the line which sends client config to server (the same fix that you made in your local branch) and pushed to enhance_notify_text
.
Wow... Having messages that are set to 'strategyEnum:` on the notification params... E.g.
void sendGps(String fromAtsign, String toAtsign, String nameSpace, String deviceName,
NotificationService notificationService, AtSignLogger logger, String input) async {
if (!(input == "")) {
try {
await notificationService.notify(NotificationParams.forText(input, toAtsign, shouldEncrypt: true, notifier: deviceName, strategyEnum: StrategyEnum.latest),checkForFinalDeliveryStatus: false,
onSuccess: (notification) {
logger.info('SUCCESS:$notification');
}, onError: (notification) {
logger.info('ERROR:$notification');
}, onSentToSecondary: (notification) {
logger.info('SENT:$notification');
}, waitForFinalDeliveryStatus: false);
} catch (e) {
logger.severe(e.toString());
}
}
}
Results in very low memory fortprint on secondary so far (4 hours up so far)
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
4ebe0e6d4cd0 atgps-receiver_secondary.1.akd5d6iwubicgea3x28l9ccpr 0.91% 65.54MiB / 3.84GiB 1.67% 67.2MB / 40.3MB 0B / 92MB 10
51e3de29cecb atgps01_secondary.1.931t6k6j49q3kdj8ftoxjrs63 0.54% 33.1MiB / 3.84GiB 0.84% 14.1MB / 14.3MB 16.4kB / 144MB 8
ae1b1bf2b14e atgps02_secondary.1.ozj1kzbmv4gir80jueywz31ub 0.64% 34.02MiB / 3.84GiB 0.87% 13.7MB / 13.9MB 4.1kB / 141MB 8
Willl monitor for a few more days
@cconstab is this particular ticket ready to be closed?
Will close once has been verified in Canary and released to production.
This has been stress-tested pretty comprehensively.
Describe the bug When receiving a notification from another atSign the memory foot print of the secondary increases and never goes down. Eventually the docker container constraints are hit and the secondary restarted and the process continues.
To Reproduce Steps to reproduce the behavior:
Expected behavior Memory to be used and then be returned to the heap
Screenshots If applicable, add screenshots to help explain your problem.
screen capture of leak on receiving atSign @ visual61
screen capture of sending atSign @ kryz_9850
Smartphone (please complete the following information):
Were you using an @application when the bug was found?
Additional context CPU also seems much higher than expeected and is pined at 25% (the max for a container) I increased the memory limit to 1000M for @ visual61 to prove the point.