Update search-engine resource allocations

WadeBarnes commented 1 year ago

I've been testing this configuration under load since 6:30am this morning. There have been a few instances of the OOM errors pop up. These typically occur during the time of high CPU usage. The CPU usage bumps into the pod's limit and causes the pod to be taken offline, in turn causing a service interruption.

WadeBarnes commented 1 year ago

At first the high CPU usage seemed to be related to the times the BC Registries agent would process new credentials (BC Reg 0 and 30, LEAR 15 and 45), however that does not seem to be the case>

WadeBarnes commented 1 year ago

Start parameters for Solr (to look into the other options):

/usr/local/openjdk-11/bin/java-server-Xms512m-Xmx2048m-XX:+UseG1GC-XX:+PerfDisableSharedMem-XX:+ParallelRefProcEnabled-XX:MaxGCPauseMillis=250-XX:+UseLargePages-XX:+AlwaysPreTouch-XX:+ExplicitGCInvokesConcurrent-Xlog:gc*:file=/var/solr/logs/solr_gc.log:time,uptime:filecount=9,filesize=20M-Dsolr.jetty.inetaccess.includes=-Dsolr.jetty.inetaccess.excludes=-Dsolr.log.level=WARN-Dsolr.log.dir=/var/solr/logs-Djetty.port=8983-DSTOP.PORT=7983-DSTOP.KEY=solrrocks-Duser.timezone=UTC-XX:-OmitStackTraceInFastThrow-Djetty.home=/opt/solr/server-Dsolr.solr.home=/var/solr/data-Dsolr.data.home=-Dsolr.install.dir=/opt/solr-Dsolr.default.confdir=/opt/solr/server/solr/configsets/_default/conf-Dlog4j.configurationFile=/var/solr/log4j2.xml-Xss256k-jarstart.jar--module=http1002370000@search-engine-bc-59-498gs

WadeBarnes commented 1 year ago

Stopped load test and will allow the current pod to run over night and review the metrics and logs in the morning.

WadeBarnes commented 1 year ago

-XX:+UseLargePages might be something to look into. There are warnings related to this on pod startup; OpenJDK 64-Bit Server VM warning: Failed to reserve shared memory. (error = 12)

WadeBarnes commented 1 year ago

Also need to review API logs. Seems like some sync/scanning is going on at times throughout the day.

WadeBarnes commented 1 year ago

Increased CPU allocation to see if it would peek just about the 4CPU mark. CPU was pinned shortly after the pod was started:

WadeBarnes commented 1 year ago

Compare previous image to the current image. We did not have this many issues with the previous image. On the surface they look the same (same Solr version), but the config may be slightly different.

WadeBarnes commented 1 year ago

Info for the last 14hours. Became completely unresponsive at 4:23 AM. The pod was running and passing health checks, but was not responding to any search queries.

WadeBarnes commented 1 year ago

Rolled back to the previous search-engine image.

WadeBarnes commented 1 year ago

Image comparison:

search-engine@sha256:e6abd08ea2079377a6afc5b2754c2d2ad6129e0672078758a868b957f7a870d4
- The one with the reported issues.
- Base image (search-engine-base-93), search-engine-base@sha256:a36c37e028eee079f8811b9cecd571ec1051b6a5254b8df8a4f11d44c172be73
- This is the most recent base image.
- Built off https://github.com/bcgov/openshift-solr/releases/tag/8.8.3
- Aries-VCR build commit, f77b33f78fb949cf2323c5ac641eaf3b62b890a5
search-engine@sha256:c092c6b9a5a004e173b97af9e026c62557b121434cd2bf8459fcdba40f925cff
- The "previous" image
- Base image (search-engine-base-89), image-registry.openshift-image-registry.svc:5000/8ad0ea-tools/search-engine-base@sha256:7069b8e7bca1770a529f14701e26781b025e13af31ad7e1e45f9046cb2700249
- This base image does not exist anymore.
- Aries-VCR build commit, dd2a7fb08b46773ccfe139c69c7d3e132acacfc9

There are no changes to any of the Solr config between the Aries-VCR commits, and both images contain the Log4j mitigations, indicating the base images were built off the same version of the openshift-solr source. This means the two images should behave the same.

WadeBarnes commented 1 year ago

Upgraded the base image to Solr 8.11.2 (built off the solr-8.11.2 branch for testing); https://github.com/bcgov/openshift-solr/pull/15

cmd line for Solr 8.11.2 container:

/opt/java/openjdk/bin/java-server-Xms512m-Xmx2048m-XX:+UseG1GC-XX:+PerfDisableSharedMem-XX:+ParallelRefProcEnabled-XX:MaxGCPauseMillis=250-XX:+UseLargePages-XX:+AlwaysPreTouch-XX:+ExplicitGCInvokesConcurrent-Xlog:gc*:file=/var/solr/logs/solr_gc.log:time,uptime:filecount=9,filesize=20M-Dsolr.jetty.inetaccess.includes=-Dsolr.jetty.inetaccess.excludes=-Dsolr.log.level=WARN-Dsolr.log.dir=/var/solr/logs-Djetty.port=8983-DSTOP.PORT=7983-DSTOP.KEY=solrrocks-Duser.timezone=UTC-XX:-OmitStackTraceInFastThrow-XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 8983 /var/solr/logs-Djetty.home=/opt/solr/server-Dsolr.solr.home=/var/solr/data-Dsolr.data.home=-Dsolr.install.dir=/opt/solr-Dsolr.default.confdir=/opt/solr/server/solr/configsets/_default/conf-Dlog4j.configurationFile=/var/solr/log4j2.xml-Xss256k-jarstart.jar--module=http--module=gzip10

WadeBarnes commented 1 year ago

New version still encounters OOM issues, but now has a script that runs to kill the process right away when OOM errors occur. Which means the pod gets restarted whenever it encounters an OOM error. So same problem, just a slightly different behavior.

WadeBarnes commented 1 year ago

Manually adjusted memory allocation and settings for testing:

- resources:
    limits:
      cpu: '5'
      memory: 8Gi
    requests:
      cpu: 500m
      memory: 3Gi

SOLR_JAVA_MEM=-Xms1024m -Xmx4096m

GC_TUNE=-XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=250 -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -XX:InitiatingHeapOccupancyPercent=75

Notable changes to GC_TUNE: Removed -XX:+UseLargePages which eliminates the OpenJDK 64-Bit Server VM warning: Failed to reserve shared memory. (error = 12) warning messages. Added -XX:InitiatingHeapOccupancyPercent=75 to control the frequency of garbage collection.

WadeBarnes commented 1 year ago

After updating all the settings I scanned through the logs to identify queries associated with high load conditions that were causing service interruptions and incorporated them into the loadTest scripts. The search-engine as been stable under load from 4 instances of the script running concurrently for the last hour.

WadeBarnes commented 1 year ago

The search-engine pod faired much better over the past 19 hours, than it did the previous few days.

The pod was restarted a total of 4 times between ~3:45pm and 5pm. This was due to activity after the load testing was completed (~1:30 - 2:30). I'll be digging into the logs for details. There is some fine tuning needed.

search-engine restart times (due to OOM issues, via OOM killer script):

September 13th 2023, 16:04:09.591
September 13th 2023, 16:20:32.341
September 13th 2023, 16:39:53.102
September 13th 2023, 17:05:03.713

Recovery time (restart time) is typically under a minute.

WadeBarnes commented 1 year ago

Bytespider Traffic - Not huge Amazon Singapore 47.128.0.0/14 (47.128.0.0 - 47.131.255.255)

Other traffic:

Other top IPs:

50.98.49.96
141.95.45.36
172.218.186.179

WadeBarnes commented 1 year ago

We need a better way to parse the IP addresses out of the logs in Kibana

WadeBarnes commented 1 year ago

@i5okie, @esune mentioned you might be able to help with the Kibana log parsing. What I'd like to do is parse the X-Forwarded-For field of the log message from logs like this:

WadeBarnes commented 1 year ago

Increased the logging level on the search-engine to see if we can capture the events leading up to the OOM issues. The issue could be related to credential indexing as opposed to search queries. Or, perhaps a combination of the two at times.

WadeBarnes commented 1 year ago

search-engine pod metrics since Sep 14, 2023, 9:50 AM:

The pod has restarted 8 times due to OOM errors:

September 14th 2023, 15:15:55.697
September 14th 2023, 15:46:25.683
September 14th 2023, 17:34:22.525
September 14th 2023, 23:12:35.830
September 14th 2023, 23:15:28.230
September 14th 2023, 23:18:41.520
September 14th 2023, 23:26:18.286
September 14th 2023, 23:34:55.028

WadeBarnes commented 1 year ago

Captured a log containing an OOM triggered restart. Solr-Error.log

Reviewed. Nothing obvious.

WadeBarnes commented 1 year ago

Summary over the weekend:

Resource settings:

- resources:
    limits:
      cpu: '5'
      memory: 10Gi
    requests:
      cpu: 500m
      memory: 6Gi

  - name: SOLR_JAVA_MEM
    value: '-Xms2048m -Xmx5120m'
  - name: GC_TUNE
    value: >-
      -XX:+UseG1GC -XX:+PerfDisableSharedMem
      -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=250
      -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent
      -XX:InitiatingHeapOccupancyPercent=75

Restarts (19):

September 15th 2023, 14:35:59.778
September 15th 2023, 14:39:42.382
September 15th 2023, 15:08:11.088
September 16th 2023, 17:46:15.851
September 16th 2023, 17:48:13.431
September 16th 2023, 17:52:19.325
September 16th 2023, 17:59:19.111
September 16th 2023, 18:13:22.786
September 16th 2023, 18:57:31.627
September 16th 2023, 21:41:00.354
September 17th 2023, 06:42:45.533
September 17th 2023, 06:45:57.840
September 17th 2023, 11:26:50.096
September 17th 2023, 11:28:08.882
September 17th 2023, 11:30:13.455
September 17th 2023, 11:38:28.395
September 17th 2023, 11:45:50.987
September 17th 2023, 12:17:43.509
September 17th 2023, 12:19:08.664

WadeBarnes commented 1 year ago

Finally found some good articles on troubleshooting issues with Solr configurations. The main takeaway is testing various heap sizes. The articles also cover more information on common problems with queries and caching configurations that can cause memory issues. Solr docs go into how to collect metrics and visualize the JVM heap. Good resources if we continue to run into issues.

WadeBarnes commented 1 year ago

I've refactored the load tests to, hopefully, exercise the search-engine more. I updated the search-engine configuration (increased memory and heap) and ran the load tests for a couple hours without issue. I'll let it run overnight at least to see how it handles the regular traffic loads.

WadeBarnes commented 1 year ago

Resource Settings:

- resources:
    limits:
      cpu: '5'
      memory: 16Gi
    requests:
      cpu: 500m
      memory: 8Gi

- name: SOLR_JAVA_MEM
  value: '-Xms4g -Xmx8g'
- name: GC_TUNE
  value: >-
    -XX:+UseG1GC -XX:+PerfDisableSharedMem
    -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=250
    -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent
    -XX:InitiatingHeapOccupancyPercent=75

The period between 6am and 9am was were I was performing load testing on the updated settings. There was the one restart during that time were I was pushing things a bit too hard. Load testing was performed with rate limiting adjusted to allow higher query volumes, otherwise most of the queries were getting blocked. Rate limiting was reapplied following the load testing.

Restarts since yesterday:

September 18th 2023, 06:50:31.287
September 18th 2023, 13:00:43.415
September 18th 2023, 13:10:27.709
September 18th 2023, 14:21:41.840
September 18th 2023, 15:08:26.997
September 18th 2023, 15:38:31.013
September 18th 2023, 15:49:57.219
September 18th 2023, 16:05:05.748
September 18th 2023, 16:19:51.972
September 18th 2023, 16:38:30.411
September 18th 2023, 20:35:29.097

WadeBarnes commented 1 year ago

Resource Settings:

- resources:
    limits:
        cpu: '5'
        memory: 20Gi
    requests:
        cpu: 500m
        memory: 10Gi

- name: SOLR_JAVA_MEM
    value: '-Xms8g -Xmx10g'
- name: GC_TUNE
    value: >-
    -XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled
    -XX:MaxGCPauseMillis=250 -XX:+AlwaysPreTouch
    -XX:+ExplicitGCInvokesConcurrent
    -XX:InitiatingHeapOccupancyPercent=75

Restarts since yesterday:

September 19th 2023, 12:56:24.264
September 19th 2023, 13:05:06.261
September 19th 2023, 13:31:00.405
September 19th 2023, 15:46:08.488

WadeBarnes commented 1 year ago

Resource Settings:

- resources:
    limits:
        cpu: '5'
        memory: 20Gi
    requests:
        cpu: 500m
        memory: 10Gi

- name: SOLR_JAVA_MEM
    value: '-Xms8g -Xmx10g'
- name: GC_TUNE
    value: >-
    -XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled
    -XX:MaxGCPauseMillis=250 -XX:+AlwaysPreTouch
    -XX:+ExplicitGCInvokesConcurrent
    -XX:InitiatingHeapOccupancyPercent=75

Restarts since yesterday:

September 20th 2023, 11:49:08.473
September 20th 2023, 12:08:22.863
September 20th 2023, 12:14:43.854
September 20th 2023, 12:38:25.032
September 20th 2023, 13:06:03.496
September 20th 2023, 15:07:28.761

WadeBarnes commented 1 year ago

Resource Settings:

- resources:
    limits:
        cpu: '5'
        memory: 20Gi
    requests:
        cpu: 500m
        memory: 10Gi

- name: SOLR_JAVA_MEM
    value: '-Xms8g -Xmx10g'
- name: GC_TUNE
    value: >-
    -XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled
    -XX:MaxGCPauseMillis=250 -XX:+AlwaysPreTouch
    -XX:+ExplicitGCInvokesConcurrent
    -XX:InitiatingHeapOccupancyPercent=75

Load testing between 8-10am using some queries added to the load test scripts. I was able to force the search-engine to restart once in that period, near the beginning of the run.

Note:

Typical multiple restart pattern did not occur in the 12:00 time block.
3 restarts occurred in the 15:00 time block, when search queries were lower than normal.

Restarts since yesterday:

September 21st 2023, 08:38:59.721
September 21st 2023, 09:07:06.038
September 21st 2023, 16:48:09.813
September 21st 2023, 16:58:45.291
September 21st 2023, 17:06:56.915

WadeBarnes commented 1 year ago

Resource Settings:

- resources:
    limits:
      cpu: '5'
      memory: 20Gi
    requests:
      cpu: 500m
      memory: 10Gi

- name: SOLR_JAVA_MEM
  value: '-Xms8g -Xmx10g'
- name: GC_TUNE
  value: >-
    -XX:+UseG1GC -XX:+PerfDisableSharedMem
    -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=250
    -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent
    -XX:InitiatingHeapOccupancyPercent=75

Restarts since last report:

October 1st 2023, 14:04:24.287
October 1st 2023, 19:44:39.397
October 1st 2023, 22:19:47.885
October 2nd 2023, 04:22:02.830
October 2nd 2023, 06:27:20.274
October 2nd 2023, 07:12:07.662
October 2nd 2023, 07:16:47.664
October 2nd 2023, 14:05:32.630
October 4th 2023, 18:47:10.224
October 4th 2023, 18:54:13.308
October 4th 2023, 19:06:05.535
October 4th 2023, 19:14:47.824
October 4th 2023, 19:28:33.086
October 4th 2023, 19:51:07.088
October 4th 2023, 19:56:51.623
October 4th 2023, 20:11:00.963
October 4th 2023, 20:22:16.056
October 4th 2023, 20:28:56.536
October 4th 2023, 20:46:18.216
October 4th 2023, 20:55:36.567
October 4th 2023, 21:01:36.922
October 4th 2023, 21:09:06.920
October 4th 2023, 21:32:11.678
October 4th 2023, 21:50:30.785
October 4th 2023, 21:59:23.275
October 4th 2023, 22:10:16.848
October 4th 2023, 22:33:34.781
October 4th 2023, 22:50:00.354
October 4th 2023, 22:55:43.136
October 4th 2023, 23:06:57.429
October 4th 2023, 23:14:23.512
October 4th 2023, 23:21:32.571
October 4th 2023, 23:28:12.465
October 5th 2023, 00:01:34.585
October 5th 2023, 00:15:20.176
October 5th 2023, 00:25:50.091
October 5th 2023, 00:31:24.233
October 5th 2023, 01:02:56.832
October 5th 2023, 01:11:11.181
October 5th 2023, 01:20:57.755
October 5th 2023, 01:29:48.860
October 5th 2023, 01:54:59.748
October 5th 2023, 02:06:10.983
October 5th 2023, 02:15:55.757
October 5th 2023, 02:24:57.085
October 5th 2023, 02:46:06.866
October 5th 2023, 02:56:00.629
October 5th 2023, 03:11:29.147
October 5th 2023, 03:19:52.533
October 5th 2023, 03:30:10.058
October 5th 2023, 03:47:53.257
October 5th 2023, 03:55:20.794
October 5th 2023, 04:22:42.235
October 5th 2023, 04:29:18.952
October 5th 2023, 04:48:15.072
October 5th 2023, 04:56:06.118
October 5th 2023, 05:05:37.694
October 5th 2023, 05:13:36.939
October 5th 2023, 05:26:46.935
October 5th 2023, 05:52:43.276
October 5th 2023, 06:00:06.666
October 5th 2023, 06:10:39.611
October 5th 2023, 06:23:52.624
October 5th 2023, 06:30:18.834
October 5th 2023, 06:48:18.164
October 5th 2023, 06:58:28.304
October 5th 2023, 07:10:35.642
October 5th 2023, 07:17:50.913
October 5th 2023, 07:27:28.950

WadeBarnes commented 1 year ago

Clearly there is no correlation between query volume and the OOM restarts.

WadeBarnes commented 1 year ago

All environments have been updated with the latest configurations.

WadeBarnes commented 1 year ago

Disabled the Solr cache on the search-engine; https://github.com/bcgov/aries-vcr/issues/758

WadeBarnes commented 1 year ago

Resource Settings:

2023-10-06 to 2023-10-08

- resources:
    limits:
      cpu: '5'
      memory: 20Gi
    requests:
      cpu: 500m
      memory: 10Gi

- name: SOLR_JAVA_MEM
  value: '-Xms8g -Xmx10g'
- name: GC_TUNE
  value: >-
    -XX:+UseG1GC -XX:+PerfDisableSharedMem
    -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=250
    -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent
    -XX:InitiatingHeapOccupancyPercent=75

2023-10-08 onward

- resources:
    limits:
      cpu: '6'
      memory: 30Gi
    requests:
      cpu: '6'
      memory: 20Gi

 - name: SOLR_JAVA_MEM
    value: '-Xms16g -Xmx24g'
  - name: GC_TUNE
    value: >-
      -XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled
      -XX:MaxGCPauseMillis=250 -XX:+AlwaysPreTouch
      -XX:+ExplicitGCInvokesConcurrent
      -XX:InitiatingHeapOccupancyPercent=75

Search-engine container throttling:

Restarts since last report:

October 10th 2023, 03:31:07.930
October 12th 2023, 03:04:11.843
October 12th 2023, 03:56:30.013
October 12th 2023, 05:01:22.260
October 12th 2023, 05:55:11.003
October 12th 2023, 06:28:28.116
October 13th 2023, 12:13:39.361
October 13th 2023, 12:25:56.895
October 13th 2023, 12:40:45.049
October 13th 2023, 12:49:51.075
October 13th 2023, 13:24:17.387
October 13th 2023, 13:34:41.587
October 13th 2023, 13:44:11.618
October 13th 2023, 13:58:55.254
October 13th 2023, 14:27:28.602
October 14th 2023, 07:13:07.981
October 14th 2023, 07:39:07.003
October 14th 2023, 08:25:35.409
October 14th 2023, 09:26:22.372
October 14th 2023, 09:33:54.701
October 14th 2023, 09:39:10.708
October 14th 2023, 09:47:08.524
October 14th 2023, 10:20:28.245
October 14th 2023, 10:25:58.895
October 14th 2023, 11:09:55.271
October 14th 2023, 11:41:24.026
October 14th 2023, 12:28:00.140
October 14th 2023, 12:44:51.028
October 14th 2023, 13:20:45.687
October 14th 2023, 14:25:40.141
October 14th 2023, 16:43:50.757
October 14th 2023, 19:30:12.245
October 14th 2023, 21:59:35.091
October 14th 2023, 22:05:41.511

Note resources were significantly increased in response to a heavy query load on 2023-10-08. Resources on other services were further adjusted on 2023-10-13 in response to the significant spike in query volume. The number of frontend and API instances were decreased and the query rate limit was also decreased.

One thing to note is that the search-engine experiences a very high percentage of throttling during the periods when it's restarting the most, even when the container has been given a significant allocation of CPU. CPU usage also spikes during these periods to the point where the container goes beyond it's allocation for long enough to be taken offline by k8s.

WadeBarnes commented 1 year ago

Reminder to monitor the active threads (top -H) during heavy load. I suspect the GC threads will be the most active.

top - 16:00:14 up 130 days, 22:34,  0 users,  load average: 9.78, 8.58, 9.21
Threads:  56 total,   3 running,  53 sleeping,   0 stopped,   0 zombie
%Cpu(s):  24.9/9.4    34[                                                                                                    ]
MiB Mem : 38.0/257605.4 [                                                                                                    ]
MiB Swap:  0.0/0.0      [                                                                                                    ]

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                 
      1 1002370+  20   0    2620    600    528 S   0.0   0.0   0:00.08 sh                                                                                                                                                      
      7 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.02  `- java                                                                                                                                                
     58 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:02.06      `- java                                                                                                                                            
     59 1002370+  20   0   35.5g  26.1g   1.2g S  11.6  10.4  37:13.99      `- GC Thread#0                                                                                                                                     
     60 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:01.46      `- G1 Main Marker                                                                                                                                  
     61 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4  12:20.16      `- G1 Conc#0                                                                                                                                       
     62 1002370+  20   0   35.5g  26.1g   1.2g S  11.3  10.4  37:13.85      `- GC Thread#1                                                                                                                                     
     63 1002370+  20   0   35.5g  26.1g   1.2g S  11.3  10.4  37:13.04      `- GC Thread#2                                                                                                                                     
     64 1002370+  20   0   35.5g  26.1g   1.2g S  11.6  10.4  37:13.75      `- GC Thread#3                                                                                                                                     
     65 1002370+  20   0   35.5g  26.1g   1.2g S  11.6  10.4  37:13.75      `- GC Thread#4                                                                                                                                     
     66 1002370+  20   0   35.5g  26.1g   1.2g S  12.0  10.4  37:13.48      `- GC Thread#5                                                                                                                                     
     67 1002370+  20   0   35.5g  26.1g   1.2g S   0.3  10.4  13:42.93      `- G1 Refine#0                                                                                                                                     
     68 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:34.28      `- G1 Young RemSet                                                                                                                                 
     69 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:19.12      `- VM Thread                                                                                                                                       
     70 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.02      `- Reference Handl                                                                                                                                 
     71 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.00      `- Finalizer                                                                                                                                       
     72 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.00      `- Signal Dispatch                                                                                                                                 
     73 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.00      `- Service Thread                                                                                                                                  
     74 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   1:15.37      `- C2 CompilerThre                                                                                                                                 
     75 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:07.41      `- C1 CompilerThre                                                                                                                                 
     76 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.22      `- Sweeper thread                                                                                                                                  
     77 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:45.94      `- VM Periodic Tas                                                                                                                                 
     78 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.08      `- Common-Cleaner                                                                                                                                  
     81 1002370+  20   0   35.5g  26.1g   1.2g S   0.3  10.4   3:20.85      `- Log4j2-TF-3-Asy                                                                                                                                 
     82 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.00      `- ShutdownMonitor                                                                                                                                 
     86 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:03.22      `- qtp1548081008-1                                                                                                                                 
     93 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.04      `- Session-HouseKe                                                                                                                                 
     94 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4  12:20.68      `- G1 Conc#1                                                                                                                                       
     95 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.00      `- h2sc-6-thread-1                                                                                                                                 
     96 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.14      `- Connection evic                                                                                                                                 
     97 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.13      `- Connection evic

WadeBarnes commented 1 year ago

search-engine right before being OOM killed:

top - 15:18:30 up 19:56,  0 users,  load average: 16.74, 30.24, 26.22
Threads: 257 total,   7 running, 250 sleeping,   0 stopped,   0 zombie
%Cpu(s): 26.8 us,  4.1 sy,  0.0 ni, 68.5 id,  0.0 wa,  0.3 hi,  0.4 si,  0.0 st
MiB Mem : 257605.4 total, 121852.3 free,  64512.1 used,  71241.0 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used. 190113.6 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                 
     63 1002370+  20   0   46.8g  26.6g 301972 R  99.3  10.6   2:32.57 GC Thread#2                                                                                             
     59 1002370+  20   0   46.8g  26.6g 301972 R  99.0  10.6   2:32.81 GC Thread#0                                                                                             
     65 1002370+  20   0   46.8g  26.6g 301972 R  99.0  10.6   2:32.30 GC Thread#4                                                                                             
     66 1002370+  20   0   46.8g  26.6g 301972 R  99.0  10.6   2:32.71 GC Thread#5                                                                                             
     62 1002370+  20   0   46.8g  26.6g 301972 R  98.7  10.6   2:32.95 GC Thread#1                                                                                             
     64 1002370+  20   0   46.8g  26.6g 301972 R  98.7  10.6   2:32.60 GC Thread#3                                                                                             
     69 1002370+  20   0   46.8g  26.6g 301972 S   0.7  10.6   0:01.38 VM Thread                                                                                               
    407 1002370+  20   0    9356   3972   3188 R   0.3   0.0   0:00.66 top                                                                                                     
      1 1002370+  20   0    2620    604    528 S   0.0   0.0   0:00.06 sh                                                                                                      
      7 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:00.01 java                                                                                                    
     58 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:01.95 java                                                                                                    
     60 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:00.00 G1 Main Marker                                                                                          
     61 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:18.66 G1 Conc#0                                                                                               
     67 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:34.62 G1 Refine#0                                                                                             
     68 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:00.16 G1 Young RemSet                                                                                         
     70 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:00.00 Reference Handl                                                                                         
     71 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:00.00 Finalizer                                                                                               
     72 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:00.00 Signal Dispatch                                                                                         
     73 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:00.00 Service Thread                                                                                          
     74 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:16.11 C2 CompilerThre

WadeBarnes commented 1 year ago

The implementation of the blocklist, https://github.com/bcgov/orgbook-configurations/pull/138, as allowed us to limit the unbounded queries (https://github.com/bcgov/aries-vcr/issues/762) causing the pressure on the search-engine resources. Therefore we're going to try out a resource reduction.

WadeBarnes commented 1 year ago

Since the introduction of the blocklist both the api and search-engine pods have seen far less activity and have been maintaining more sane resource levels. There have been no pod restarts, though the memory usage on the api pods is growing.

For example:

WadeBarnes commented 1 year ago

Resource usage has been steady and under control for both the serach-engine and api pods. Memory use for the api pods has been stable. No pod restarts.

WadeBarnes commented 1 year ago

Commit history has been cleaned up, and the configurations have been deployed and tested in dev, test, and prod.

WadeBarnes commented 1 year ago

Search-engine resource use:

bcgov / orgbook-configurations

Update search-engine resource allocations #137