ctm / mb2-doc

Mb2, poker software
https://devctm.com
7 stars 2 forks source link

Docker memory creeping up #1358

Closed ctm closed 5 months ago

ctm commented 6 months ago

Figure out why the MEM USAGE column is much higher on craftpoker.com than it used to be.

I have noticed this for a while and absolutely should have written it down when it started happening, but that value fluctuates quite a bit and by the time I was really certain something was different, it had been going on too long for me to pinpoint when.

Every time I deploy, I do asudo docker stats --no-stream ever since we had some trouble running out of memory (#1059), back in September of 2022. That issue has some of the exploration I did of leaks back then. I never did find the cause of the memory growth. Here are all the recent values:

Wed Mar 20 19:01:06 UTC 2024
CONTAINER ID   NAME      CPU %     MEM USAGE / LIMIT     MEM %     NET I/O          BLOCK I/O     PIDS
0b2395535f25   mb2       0.07%     38.23MiB / 928.5MiB   4.12%     913kB / 3.88MB   29.8MB / 0B   8

Thu Mar 21 01:42:31 UTC 2024
CONTAINER ID   NAME      CPU %     MEM USAGE / LIMIT    MEM %     NET I/O         BLOCK I/O     PIDS
48fb2bf6e543   mb2       0.13%     35.2MiB / 928.5MiB   3.79%     354kB / 5.8MB   28.4MB / 0B   8

Thu Mar 21 11:34:39 UTC 2024
CONTAINER ID   NAME      CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O     PIDS
48fb2bf6e543   mb2       0.11%     75.58MiB / 928.5MiB   8.14%     17.9MB / 63.3MB   38.6MB / 0B   8

Fri Mar 22 11:16:37 UTC 2024
CONTAINER ID   NAME      CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O     PIDS
52650ece9396   mb2       0.18%     93.73MiB / 928.5MiB   10.09%    13.5MB / 58.2MB   38.7MB / 0B   10

Sat Mar 23 11:47:44 UTC 2024
CONTAINER ID   NAME      CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O     PIDS
b8e87623a1c7   mb2       0.04%     57.46MiB / 928.5MiB   6.19%     1.01MB / 7.99MB   34.2MB / 0B   8

Currently, the stats output just goes to an emacs buffer, so whenever I restart emacs, I lose that data. Before I started using various rust-analyzer using emacs packages, emacs would stay up for weeks, now it stays up for days :-(.

I don't normally do a docker ps, so the amount of uptime isn't in my logs, but I can change that. In the above, however, container 48fb2bf6e543 is mentioned twice, the first time being 7:42 pm Mountain Wednesday the 20th, after a fresh deploy, and the second time 5:34am the following day, and the growth is from 35 MiB to 75 MiB and there were no non-demo events (I haven't looked for demo ones) during that time, so presumably the growth is associated with connections probing the site for vulnerabilities.

FTR, I do have other logs that can show me a fair amount of info containing both which paths have been queried as well as which connections have failed due to strangeness. I don't think there has been any uptick of the latter, but the former is fairly likely. I can hack up websocket_tool and see if I can reproduce anything if it comes to that.

For now, this is a concern, but I haven't seen any signs of unlimited growth, just growth that is different from what it's been in the past. I'm leaving this high priority and will probably change some of what I log in the near future, but I probably won't do any in depth analysis until I can devote a block of time to dig deep.

ctm commented 6 months ago

It probably makes sense for me to add peak_alloc in and have it dump stats every five or ten minutes. That would be sufficient to see if the docker memory that is being consumed corresponds to heap allocation and if it does, whether the growth is slow or comes in bursts. After that, it might make sense to read How to investigate memory usage of your rust program and perhaps Measuring Memory Usage in Rust, but my guess is peak_alloc will be enough, combined with hacking up websocket_tool will be sufficient.

ctm commented 6 months ago

What follows shows substantial growth between my late yesterday afternoon deployment (before the evening tournament had run) and this morning's deployment. Other than the evening game, no demo tournaments were run (I checked).

Sun Mar 24 22:54:08 UTC 2024
CONTAINER ID   NAME      CPU %     MEM USAGE / LIMIT     MEM %     NET I/O          BLOCK I/O     PIDS
d9cbc1a2235c   mb2       0.12%     33.32MiB / 928.5MiB   3.59%     384kB / 4.48MB   28.3MB / 0B   8
CONTAINER ID   IMAGE              COMMAND   CREATED              STATUS              PORTS                                                                                                                                     NAMES
d9cbc1a2235c   mb2:202403241648   "/mb2"    About a minute ago   Up About a minute

Mon Mar 25 11:31:58 UTC 2024
CONTAINER ID   NAME      CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O     PIDS
d9cbc1a2235c   mb2       0.15%     104.3MiB / 928.5MiB   11.23%    20.5MB / 65.7MB   38.1MB / 0B   9
CONTAINER ID   IMAGE              COMMAND   CREATED        STATUS        PORTS                                                                                                                                     NAMES
d9cbc1a2235c   mb2:202403241648   "/mb2"    13 hours ago   Up 13 hours

So, in thirteen hours, mb2's docker memory grew from 33.32MiB to 104.3MiB. During that time it doesn't appear that much happened, as the following summary stats show. However, I didn't do the obvious thing of checking memory before the start of the tournament and at the end.

2024-03-25T11:37:36.498Z INFO  [mb2] status_codes:
  200: 373
  304: 217
  101: 29
  405: 2
  400: 8
  308: 49
warnings:
  Error encountered while processing the incoming HTTP request: BadStart('.'): 8
errors:
  stream error: request parse error: invalid HTTP version specified: 1
  stream error: request parse error: invalid Header provided: 6
exceptions:
  segment started with invalid character: ('.'): 8

If the logs had peak_alloc stats interspersed, I'd have much better idea of where to look next, especially since checking memory before and after the tournament would happen automatically. So, I'll probably add that today (perhaps as soon as I've looked at the db logs for hints as to what interfered with ODB's discard (#1362)).

ctm commented 6 months ago

Here's a rise from 55.74MiB to 66.86MiB in about two and a half hours:

Mon Mar 25 11:40:14 UTC 2024
CONTAINER ID   NAME      CPU %     MEM USAGE / LIMIT     MEM %     NET I/O          BLOCK I/O   PIDS
b1145a47fe46   mb2       0.14%     55.74MiB / 928.5MiB   6.00%     406kB / 7.23MB   28MB / 0B   8
CONTAINER ID   IMAGE              COMMAND   CREATED         STATUS         PORTS                                                                                                                                     NAMES
b1145a47fe46   mb2:202403250531   "/mb2"    2 minutes ago   Up 2 minutes

Mon Mar 25 14:09:03 UTC 2024
CONTAINER ID   NAME      CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O   PIDS
b1145a47fe46   mb2       0.16%     66.86MiB / 928.5MiB   7.20%     2.84MB / 15.4MB   28MB / 0B   8
CONTAINER ID   IMAGE              COMMAND   CREATED       STATUS       PORTS                                                                                                                                     NAMES
b1145a47fe46   mb2:202403250531   "/mb2"    3 hours ago   Up 3 hours

And here is the log summary:

  405: 1
  101: 3
  308: 1
  400: 2
  200: 74
  304: 15
warnings:
  Error encountered while processing the incoming HTTP request: BadStart('.'): 2
errors:
  stream error: request parse error: invalid Header provided: 2
exceptions:
  segment started with invalid character: ('.'): 2

And I've checked the log; no tournaments were played during that time.

BTW, the amount of docker memory that is taken up on the initial launch appears to vary greatly even with the same container. I just installed a new kernel (just a regular security update) and so I had to reboot and here we were taking up very little memory (17.27 MiB) 26 minutes after the container was created and right before rebooting:

Mon Mar 25 14:40:27 UTC 2024
CONTAINER ID   NAME      CPU %     MEM USAGE / LIMIT     MEM %     NET I/O          BLOCK I/O     PIDS
3c51784cb39e   mb2       0.04%     17.27MiB / 928.5MiB   1.86%     876kB / 19.3MB   29.5MB / 0B   8
CONTAINER ID   IMAGE              COMMAND   CREATED          STATUS          PORTS                                                                                                                                     NAMES
3c51784cb39e   mb2:202403250810   "/mb2"    26 minutes ago   Up 26 minutes

but here, cranking up that same image takes up 47.79 MiB after only 39 seconds being up:

Mon Mar 25 14:43:39 UTC 2024
CONTAINER ID   NAME      CPU %     MEM USAGE / LIMIT     MEM %     NET I/O         BLOCK I/O   PIDS
3c51784cb39e   mb2       0.13%     47.79MiB / 928.5MiB   5.15%     343kB / 4.4MB   40MB / 0B   8
CONTAINER ID   IMAGE              COMMAND   CREATED          STATUS          PORTS                                                                                                                                     NAMES
3c51784cb39e   mb2:202403250810   "/mb2"    29 minutes ago   Up 39 seconds

Once again, some peak_alloc log lines would give me useful information, so I'm going to add that now.

ctm commented 6 months ago

Thanks to actix-cron-example, adding the instrumentation was trivial:

diff --git a/mb2/src/main.rs b/mb2/src/main.rs
index 210f125f..c4be17b3 100644
--- a/mb2/src/main.rs
+++ b/mb2/src/main.rs
@@ -549,6 +549,35 @@ fn header_replace(
     }
 }

+#[cfg_attr(feature = "monitor-memory", global_allocator)]
+#[cfg(feature = "monitor-memory")]
+static PEAK_ALLOC: peak_alloc::PeakAlloc = peak_alloc::PeakAlloc;
+
+#[cfg(feature = "monitor-memory")]
+fn log_memory_use() {
+    log::info!(
+        "Using {} MB, Peak {} MB",
+        PEAK_ALLOC.current_usage_as_mb(),
+        PEAK_ALLOC.peak_usage_as_mb()
+    );
+}
+
+#[cfg(feature = "monitor-memory")]
+pub async fn monitor_memory() {
+    use {
+        chrono::Utc,
+        tokio_schedule::{every, Job},
+    };
+
+    every(5)
+        .minutes()
+        .in_timezone(&Utc)
+        .perform(|| async {
+            log_memory_use();
+        })
+        .await;
+}
+
 #[actix_web::main]
 async fn main() -> std::io::Result<()> {
     // Needed when doing benchmarking, ugh!
@@ -565,6 +594,14 @@ async fn main() -> std::io::Result<()> {
         .init()
         .unwrap();

+    #[cfg(feature = "monitor-memory")]
+    {
+        log_memory_use();
+        actix_rt::spawn(async move {
+            monitor_memory().await;
+        });
+    }
+
     let summary = HttpSummary::default();

     let layer = HttpSummarizeLayer::new(summary.clone());

It's been running for a while, but there's been no growth to speak of either in heap allocations or in the Docker memory usage:


ubuntu@ip-172-30-2-88:~$ sudo docker logs mb2 | grep Peak
sudo docker logs mb2 | grep Peak
messages: []
2024-03-25T15:40:11.762Z INFO  [mb2] Using 0.046097755 MB, Peak 0.05503559 MB
2024-03-25T15:45:00.001Z INFO  [mb2] Using 2.216959 MB, Peak 10.774069 MB
...
2024-03-25T18:20:00.000Z INFO  [mb2] Using 1.8624697 MB, Peak 10.774069 MB

ubuntu@ip-172-30-2-88:~$ date; sudo docker stats --no-stream ; sudo docker ps -a
Mon Mar 25 18:20:51 UTC 2024
CONTAINER ID   NAME      CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O    PIDS
80d006f7f887   mb2       0.05%     16.68MiB / 928.5MiB   1.80%     3.49MB / 4.81MB   397kB / 0B   8
CONTAINER ID   IMAGE              COMMAND   CREATED       STATUS       PORTS                                                                                                                                     NAMES
80d006f7f887   mb2:202403250935   "/mb2"    3 hours ago   Up 3 hours
ctm commented 6 months ago

So, we've now gone from 16.68 MiB to 52.33 MiB, but our peak heap alloc is only 19 MB.

Mon Mar 25 19:14:57 UTC 2024
CONTAINER ID   NAME      CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O    PIDS
80d006f7f887   mb2       0.09%     52.33MiB / 928.5MiB   5.64%     4.54MB / 11.6MB   401kB / 0B   8
CONTAINER ID   IMAGE              COMMAND   CREATED       STATUS       PORTS                                                                                                                                     NAMES
80d006f7f887   mb2:202403250935   "/mb2"    4 hours ago   Up 4 hours

...
2024-03-25T18:50:00.000Z INFO  [mb2] Using 1.8858166 MB, Peak 11.194394 MB
2024-03-25T18:53:08.100Z INFO  [actix_web::middleware::logger] C3D2AA5 0.187453 200 1342746 "GET /mb2-web.js HTTP/2.0" INM:- E:"4E6C0D5B"
2024-03-25T18:53:08.384Z INFO  [actix_web::middleware::logger] 1DF7F60 0.000394 304 0 "GET /spinner.css HTTP/2.0" INM:"835A7FF8" E:"835A7FF8"
2024-03-25T18:53:08.920Z INFO  [actix_web::middleware::logger] C3D2AA5 0.009824 200 12190 "GET /300.mb2-web.js HTTP/2.0" INM:"B7192F78" E:"D4669122"
2024-03-25T18:53:08.965Z INFO  [actix_web::middleware::logger] 1DF7F60 0.164745 200 1378567 "GET /mb2-web.wasm HTTP/2.0" INM:- E:"2EF9EF05"
2024-03-25T18:53:19.158Z INFO  [actix_web::middleware::logger] 20BFA4B 0.000186 200 0 "HEAD / HTTP/2.0" INM:- E:"8C7ACBDB"
2024-03-25T18:53:37.636Z INFO  [actix_web::middleware::logger] 1DF7F60 0.000725 200 968 "GET /favicon.ico HTTP/2.0" INM:- E:"8C7ACBDB"
2024-03-25T18:54:59.177Z INFO  [actix_web::middleware::logger] 85F6095 0.009848 200 12190 "GET /300.mb2-web.js HTTP/2.0" INM:"AE3AEFF4" E:"D4669122"
2024-03-25T18:54:59.344Z INFO  [actix_web::middleware::logger] DEB0EFA 0.177370 200 1378567 "GET /mb2-web.wasm HTTP/2.0" INM:- E:"2EF9EF05"
2024-03-25T18:55:00.001Z INFO  [mb2] Using 2.1165848 MB, Peak 19.494936 MB
...
I believe I've seen Docker MEM USAGE decrease, so I don't think it's a high water mark, per-se. My guess is that we're seeing fragmentation, so I'll look for a cheap way to find the processes running size and add that to what I'm logging every five minutes.

I'll also want to look into how Cloudflare caches. It could be that I don't have it configured to use If-None-Match.
ctm commented 6 months ago

I've added memory-stats and am deploying now.

ctm commented 6 months ago

Initially, the "Phys" value corresponded with Docker's "MEM USAGE":

 Mon Mar 25 20:53:08 UTC 2024
CONTAINER ID   NAME      CPU %     MEM USAGE / LIMIT     MEM %     NET I/O          BLOCK I/O     PIDS
6de5dbe04e73   mb2       0.04%     15.82MiB / 928.5MiB   1.70%     298kB / 1.55MB   8.19kB / 0B   8
CONTAINER ID   IMAGE              COMMAND   CREATED              STATUS          PORTS                                                                                                                                     NAMES
6de5dbe04e73   mb2:202403251446   "/mb2"    About a minute ago   Up 59 seconds 
ubuntu@ip-172-30-2-88:~$ sudo docker logs -f mb2
[]
2024-03-25T20:52:10.563Z INFO  [mb2] Using 0.046135902 MB, Peak 0.42109776 MB, phys 15.917969 MB, virt 34.46875 MB

But then Phys got much larger while MEM USAGE stayed the same:

2024-03-25T20:55:00.004Z INFO  [mb2] Using 1.9349155 MB, Peak 11.136893 MB, phys 36.566406 MB, virt 965.625 MB
  C-c C-c^C
ubuntu@ip-172-30-2-88:~$ rm *.tar ; sudo docker system prune -af && df . ; date; sudo docker stats --no-stream ; sudo docker ps -a
rm *.tar ; sudo docker system prune -af && df . ; date; sudo docker stats --no-stream ; sudo docker ps -a
rm: cannot remove '*.tar': No such file or directory
Total reclaimed space: 0B
Filesystem     1K-blocks    Used Available Use% Mounted on
/dev/root       16069568 4224680  11828504  27% /
Mon Mar 25 20:55:06 UTC 2024
CONTAINER ID   NAME      CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O     PIDS
6de5dbe04e73   mb2       0.10%     15.87MiB / 928.5MiB   1.71%     1.03MB / 1.98MB   8.19kB / 0B   8
CONTAINER ID   IMAGE              COMMAND   CREATED         STATUS         PORTS                                                                                                                                     NAMES
6de5dbe04e73   mb2:202403251446   "/mb2"    2 minutes ago   Up 2 minutes

So, I'll continue watching the numbers, but it's time for me to start reading up on Docker memory.

ctm commented 6 months ago

It appears that Docker is including filesystem caching in their MEM USAGE stat, but I'm a bit surprised that there is even that much to cache from the filesystem. Perhaps some aspect of our use of web sockets contributes to this stat. Regardless, since we haven't actually seen an OOM and we're still a couple months away from WSOPS 2024, I'm not going to get too worried about what I'm seeing now that I know that—in theory—the vast majority of the use is in reclaimable memory.

So, most likely, what changed is either the kernel or Docker or something in between.

IIRC, I haven't seen the value get much above 100 MiB and certainly not 150 MiB, whidh is one sixth of our limit.

I should probably fire up websocket_tool sometime before stripping this issue of high priority (or even closing it), just to get several thousand connections running.

ctm commented 6 months ago

Oh, and next time the MEM USAGE gets high, I should try writing 1 to /proc/sys/vm/drop_caches and then try writing 2 to the same. Here's the relevant text from the kernel documentation:

drop_caches

Writing to this will cause the kernel to drop clean caches, as well as reclaimable slab objects like dentries and inodes. Once dropped, their memory becomes free.

To free pagecache: echo 1 > /proc/sys/vm/drop_caches To free reclaimable slab objects (includes dentries and inodes): echo 2 > /proc/sys/vm/drop_caches To free slab objects and pagecache: echo 3 > /proc/sys/vm/drop_caches

This is a non-destructive operation and will not free any dirty objects. To increase the number of objects freed by this operation, the user may run `sync' prior to writing to /proc/sys/vm/drop_caches. This will minimize the number of dirty objects on the system and create more candidates to be dropped.

ctm commented 6 months ago

Looks like fragmentation is contributing to the increase in MEM USAGE:

2024-03-25T21:55:00.004Z INFO  [mb2] Using 2.1441507 MB, Peak 11.136893 MB, phys 37.804688 MB, virt 965.625 MB
2024-03-25T21:56:27.260Z INFO  [actix_web::middleware::logger] 6A33B8D 0.000487 200 2705 "GET /favicon.ico HTTP/2.0" INM:- E:"8C7ACBDB"
2024-03-25T21:57:39.269Z INFO  [actix_web::middleware::logger] 7467921 0.000682 200 968 "GET / HTTP/2.0" INM:- E:"8C7ACBDB"
2024-03-25T21:57:39.664Z INFO  [actix_web::middleware::logger] 7467921 0.000178 304 0 "GET /spinner.css HTTP/2.0" INM:"835A7FF8" E:"835A7FF8"
2024-03-25T21:57:39.698Z INFO  [actix_web::middleware::logger] 7467921 0.000182 304 0 "GET /mb2-web.js HTTP/2.0" INM:"4E6C0D5B" E:"4E6C0D5B"
2024-03-25T21:57:39.742Z INFO  [actix_web::middleware::logger] 7467921 0.000180 304 0 "GET /300.mb2-web.js HTTP/2.0" INM:"D4669122" E:"D4669122"
2024-03-25T21:57:40.617Z INFO  [actix_web::middleware::logger] 7467921 0.000179 304 0 "GET /favicon.ico HTTP/2.0" INM:"8C7ACBDB" E:"8C7ACBDB"
2024-03-25T21:57:40.823Z INFO  [actix_web::middleware::logger] 7467921 0.381125 200 1378567 "GET /mb2-web.wasm HTTP/2.0" INM:- E:"2EF9EF05"
2024-03-25T21:57:41.640Z INFO  [actix_web::middleware::logger] 7467921 0.000182 304 0 "GET /gong.mp3 HTTP/2.0" INM:"E37DB13A" E:"E37DB13A"
2024-03-25T21:58:19.169Z INFO  [actix_web::middleware::logger] 74E6149 0.000163 200 0 "HEAD / HTTP/2.0" INM:- E:"8C7ACBDB"
2024-03-25T21:58:37.617Z INFO  [actix_web::middleware::logger] 7467921 56.102561 101 1704 "GET /ws/ HTTP/1.1" INM:- E:-
2024-03-25T22:00:00.004Z INFO  [mb2] Using 2.1866264 MB, Peak 11.533245 MB, phys 47.335938 MB, virt 965.625 MB

We go from 2.14 MB to 2.18 MB, but our peak goes from 11.13 to 11.53, which isn't much either, but in that time our phys goes from 37 to 47, the difference being about what our peak was, which suggests that the memory we used in our first peak (or previous peaks) didn't get reused.

ctm commented 6 months ago

Here's another 10MB jump:

2024-03-25T22:10:00.048Z INFO  [mb2] Using 1.9900789 MB, Peak 11.533245 MB, phys 47.460938 MB, virt 965.625 MB
2024-03-25T22:10:39.716Z INFO  [actix_web::middleware::logger] 5119AFB 0.000413 200 2705 "GET / HTTP/1.0" INM:- E:"8C7ACBDB"
2024-03-25T22:13:19.212Z INFO  [actix_web::middleware::logger] 74E6149 0.000158 200 0 "HEAD / HTTP/2.0" INM:- E:"8C7ACBDB"
2024-03-25T22:14:33.439Z INFO  [actix_web::middleware::logger] 62F8CE7 0.000257 304 0 "GET /robots.txt HTTP/2.0" INM:"8C7ACBDB" E:"8C7ACBDB"
2024-03-25T22:14:33.580Z INFO  [actix_web::middleware::logger] BB0A35F 0.000637 200 968 "GET / HTTP/2.0" INM:- E:"8C7ACBDB"
2024-03-25T22:14:35.028Z INFO  [actix_web::middleware::logger] BB0A35F 0.171257 200 1342746 "GET /mb2-web.js HTTP/2.0" INM:- E:"4E6C0D5B"
2024-03-25T22:14:35.687Z INFO  [actix_web::middleware::logger] BB0A35F 0.151492 200 1378567 "GET /mb2-web.wasm HTTP/2.0" INM:- E:"2EF9EF05"
2024-03-25T22:15:00.004Z INFO  [mb2] Using 1.8618994 MB, Peak 11.533245 MB, phys 56.48047 MB, virt 965.625 MB

FWIW, at this point, virt has remained untouched since the first sampling on the five second mark. Eventually virt did creep up ever so slightly:

2024-03-25T23:05:00.005Z INFO  [mb2] Using 2.1657162 MB, Peak 11.666552 MB, phys 57.039063 MB, virt 965.6367 MB

Between now and this evening's tournament I may play a bit with mb2 running under docker locally and see if I can find out what causes the big jump during initialization:

2024-03-25T20:52:10.563Z INFO  [mb2] Using 0.046135902 MB, Peak 0.42109776 MB, phys 15.917969 MB, virt 34.46875 MB
2024-03-25T20:55:00.004Z INFO  [mb2] Using 1.9349155 MB, Peak 11.136893 MB, phys 36.566406 MB, virt 965.625 MB
ctm commented 6 months ago

FWIW, I can't write to /proc/sys/vm/drop_caches because it's mounted read-only. I can, however, use the -v option to the run command to get a read-write version, so I've added that to bin/deploy and am deploying now.

ctm commented 6 months ago

/proc/sys/vm/drop_caches appears to be a dead end:

ubuntu@ip-172-30-2-88:~$ date; sudo docker stats --no-stream ; sudo docker ps -a
Tue Mar 26 11:46:30 UTC 2024
CONTAINER ID   NAME      CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O     PIDS
2ce5325ea146   mb2       0.05%     77.55MiB / 928.5MiB   8.35%     10.8MB / 50.1MB   37.6MB / 0B   8
CONTAINER ID   IMAGE              COMMAND   CREATED        STATUS        PORTS                                                                                                                                     NAMES
2ce5325ea146   mb2:202403252002   "/mb2"    10 hours ago   Up 10 hours

ubuntu@ip-172-30-2-88:~$ sudo docker exec -it --privileged mb2 sh
# echo 1 >  /writable_proc/sys/vm/drop_caches

ubuntu@ip-172-30-2-88:~$ date; sudo docker stats --no-stream ; sudo docker ps -a
Tue Mar 26 11:46:48 UTC 2024
CONTAINER ID   NAME      CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O     PIDS
2ce5325ea146   mb2       0.12%     77.47MiB / 928.5MiB   8.34%     10.8MB / 50.1MB   37.6MB / 0B   8
CONTAINER ID   IMAGE              COMMAND   CREATED        STATUS        PORTS                                                                                                                                     NAMES
2ce5325ea146   mb2:202403252002   "/mb2"    10 hours ago   Up 10 hours

ubuntu@ip-172-30-2-88:~$ sudo docker exec -it --privileged mb2 sh
sudo docker exec -it --privileged mb2 sh
# echo 2 >  /writable_proc/sys/vm/drop_caches

ubuntu@ip-172-30-2-88:~$ date; sudo docker stats --no-stream ; sudo docker ps -a
Tue Mar 26 11:47:02 UTC 2024
CONTAINER ID   NAME      CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O     PIDS
2ce5325ea146   mb2       0.07%     77.49MiB / 928.5MiB   8.35%     10.8MB / 50.1MB   37.7MB / 0B   8
CONTAINER ID   IMAGE              COMMAND   CREATED        STATUS        PORTS                                                                                                                                     NAMES
2ce5325ea146   mb2:202403252002   "/mb2"    10 hours ago   Up 10 hours

I'll leave the writable_proc mount for a while, but will probably pull it before too long.

I'll continue looking at the stats we're gathering, but I'm not going to futz with websocket_tool in the near future; I'll do so before closing out this issue, however.

ctm commented 5 months ago

I played around with websocket_tool and didn't see any problems with > 1,000 observers. I've been watching my stats and I don't see anything particularly concerning, so I'm closing this.