Closed WadeBarnes closed 1 year ago
At first the high CPU usage seemed to be related to the times the BC Registries agent would process new credentials (BC Reg 0 and 30, LEAR 15 and 45), however that does not seem to be the case>
Start parameters for Solr (to look into the other options):
/usr/local/openjdk-11/bin/java-server-Xms512m-Xmx2048m-XX:+UseG1GC-XX:+PerfDisableSharedMem-XX:+ParallelRefProcEnabled-XX:MaxGCPauseMillis=250-XX:+UseLargePages-XX:+AlwaysPreTouch-XX:+ExplicitGCInvokesConcurrent-Xlog:gc*:file=/var/solr/logs/solr_gc.log:time,uptime:filecount=9,filesize=20M-Dsolr.jetty.inetaccess.includes=-Dsolr.jetty.inetaccess.excludes=-Dsolr.log.level=WARN-Dsolr.log.dir=/var/solr/logs-Djetty.port=8983-DSTOP.PORT=7983-DSTOP.KEY=solrrocks-Duser.timezone=UTC-XX:-OmitStackTraceInFastThrow-Djetty.home=/opt/solr/server-Dsolr.solr.home=/var/solr/data-Dsolr.data.home=-Dsolr.install.dir=/opt/solr-Dsolr.default.confdir=/opt/solr/server/solr/configsets/_default/conf-Dlog4j.configurationFile=/var/solr/log4j2.xml-Xss256k-jarstart.jar--module=http1002370000@search-engine-bc-59-498gs
Stopped load test and will allow the current pod to run over night and review the metrics and logs in the morning.
-XX:+UseLargePages
might be something to look into. There are warnings related to this on pod startup; OpenJDK 64-Bit Server VM warning: Failed to reserve shared memory. (error = 12)
Also need to review API logs. Seems like some sync/scanning is going on at times throughout the day.
Increased CPU allocation to see if it would peek just about the 4CPU mark. CPU was pinned shortly after the pod was started:
Compare previous image to the current image. We did not have this many issues with the previous image. On the surface they look the same (same Solr version), but the config may be slightly different.
Info for the last 14hours. Became completely unresponsive at 4:23 AM. The pod was running and passing health checks, but was not responding to any search queries.
Rolled back to the previous search-engine image.
Image comparison:
search-engine@sha256:e6abd08ea2079377a6afc5b2754c2d2ad6129e0672078758a868b957f7a870d4
search-engine@sha256:c092c6b9a5a004e173b97af9e026c62557b121434cd2bf8459fcdba40f925cff
There are no changes to any of the Solr config between the Aries-VCR commits, and both images contain the Log4j mitigations, indicating the base images were built off the same version of the openshift-solr source. This means the two images should behave the same.
Upgraded the base image to Solr 8.11.2 (built off the solr-8.11.2 branch for testing); https://github.com/bcgov/openshift-solr/pull/15
cmd line for Solr 8.11.2 container:
/opt/java/openjdk/bin/java-server-Xms512m-Xmx2048m-XX:+UseG1GC-XX:+PerfDisableSharedMem-XX:+ParallelRefProcEnabled-XX:MaxGCPauseMillis=250-XX:+UseLargePages-XX:+AlwaysPreTouch-XX:+ExplicitGCInvokesConcurrent-Xlog:gc*:file=/var/solr/logs/solr_gc.log:time,uptime:filecount=9,filesize=20M-Dsolr.jetty.inetaccess.includes=-Dsolr.jetty.inetaccess.excludes=-Dsolr.log.level=WARN-Dsolr.log.dir=/var/solr/logs-Djetty.port=8983-DSTOP.PORT=7983-DSTOP.KEY=solrrocks-Duser.timezone=UTC-XX:-OmitStackTraceInFastThrow-XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 8983 /var/solr/logs-Djetty.home=/opt/solr/server-Dsolr.solr.home=/var/solr/data-Dsolr.data.home=-Dsolr.install.dir=/opt/solr-Dsolr.default.confdir=/opt/solr/server/solr/configsets/_default/conf-Dlog4j.configurationFile=/var/solr/log4j2.xml-Xss256k-jarstart.jar--module=http--module=gzip10
New version still encounters OOM issues, but now has a script that runs to kill the process right away when OOM errors occur. Which means the pod gets restarted whenever it encounters an OOM error. So same problem, just a slightly different behavior.
Manually adjusted memory allocation and settings for testing:
- resources:
limits:
cpu: '5'
memory: 8Gi
requests:
cpu: 500m
memory: 3Gi
SOLR_JAVA_MEM=-Xms1024m -Xmx4096m
GC_TUNE=-XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=250 -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -XX:InitiatingHeapOccupancyPercent=75
GC_TUNE
: Removed -XX:+UseLargePages
which eliminates the OpenJDK 64-Bit Server VM warning: Failed to reserve shared memory. (error = 12)
warning messages. Added -XX:InitiatingHeapOccupancyPercent=75
to control the frequency of garbage collection.After updating all the settings I scanned through the logs to identify queries associated with high load conditions that were causing service interruptions and incorporated them into the loadTest
scripts. The search-engine as been stable under load from 4 instances of the script running concurrently for the last hour.
The search-engine pod faired much better over the past 19 hours, than it did the previous few days.
The pod was restarted a total of 4 times between ~3:45pm and 5pm. This was due to activity after the load testing was completed (~1:30 - 2:30). I'll be digging into the logs for details. There is some fine tuning needed.
search-engine restart times (due to OOM issues, via OOM killer script):
Recovery time (restart time) is typically under a minute.
Bytespider Traffic - Not huge Amazon Singapore 47.128.0.0/14 (47.128.0.0 - 47.131.255.255)
Other traffic:
Other top IPs:
We need a better way to parse the IP addresses out of the logs in Kibana
@i5okie, @esune mentioned you might be able to help with the Kibana log parsing. What I'd like to do is parse the X-Forwarded-For
field of the log message from logs like this:
Increased the logging level on the search-engine to see if we can capture the events leading up to the OOM issues. The issue could be related to credential indexing as opposed to search queries. Or, perhaps a combination of the two at times.
search-engine pod metrics since Sep 14, 2023, 9:50 AM:
The pod has restarted 8 times due to OOM errors:
Captured a log containing an OOM triggered restart. Solr-Error.log
Summary over the weekend:
Resource settings:
- resources:
limits:
cpu: '5'
memory: 10Gi
requests:
cpu: 500m
memory: 6Gi
- name: SOLR_JAVA_MEM
value: '-Xms2048m -Xmx5120m'
- name: GC_TUNE
value: >-
-XX:+UseG1GC -XX:+PerfDisableSharedMem
-XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=250
-XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent
-XX:InitiatingHeapOccupancyPercent=75
Restarts (19):
Finally found some good articles on troubleshooting issues with Solr configurations. The main takeaway is testing various heap sizes. The articles also cover more information on common problems with queries and caching configurations that can cause memory issues. Solr docs go into how to collect metrics and visualize the JVM heap. Good resources if we continue to run into issues.
I've refactored the load tests to, hopefully, exercise the search-engine more. I updated the search-engine configuration (increased memory and heap) and ran the load tests for a couple hours without issue. I'll let it run overnight at least to see how it handles the regular traffic loads.
Resource Settings:
- resources:
limits:
cpu: '5'
memory: 16Gi
requests:
cpu: 500m
memory: 8Gi
- name: SOLR_JAVA_MEM
value: '-Xms4g -Xmx8g'
- name: GC_TUNE
value: >-
-XX:+UseG1GC -XX:+PerfDisableSharedMem
-XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=250
-XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent
-XX:InitiatingHeapOccupancyPercent=75
The period between 6am and 9am was were I was performing load testing on the updated settings. There was the one restart during that time were I was pushing things a bit too hard. Load testing was performed with rate limiting adjusted to allow higher query volumes, otherwise most of the queries were getting blocked. Rate limiting was reapplied following the load testing.
Restarts since yesterday:
Resource Settings:
- resources:
limits:
cpu: '5'
memory: 20Gi
requests:
cpu: 500m
memory: 10Gi
- name: SOLR_JAVA_MEM
value: '-Xms8g -Xmx10g'
- name: GC_TUNE
value: >-
-XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled
-XX:MaxGCPauseMillis=250 -XX:+AlwaysPreTouch
-XX:+ExplicitGCInvokesConcurrent
-XX:InitiatingHeapOccupancyPercent=75
Restarts since yesterday:
Resource Settings:
- resources:
limits:
cpu: '5'
memory: 20Gi
requests:
cpu: 500m
memory: 10Gi
- name: SOLR_JAVA_MEM
value: '-Xms8g -Xmx10g'
- name: GC_TUNE
value: >-
-XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled
-XX:MaxGCPauseMillis=250 -XX:+AlwaysPreTouch
-XX:+ExplicitGCInvokesConcurrent
-XX:InitiatingHeapOccupancyPercent=75
Restarts since yesterday:
Resource Settings:
- resources:
limits:
cpu: '5'
memory: 20Gi
requests:
cpu: 500m
memory: 10Gi
- name: SOLR_JAVA_MEM
value: '-Xms8g -Xmx10g'
- name: GC_TUNE
value: >-
-XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled
-XX:MaxGCPauseMillis=250 -XX:+AlwaysPreTouch
-XX:+ExplicitGCInvokesConcurrent
-XX:InitiatingHeapOccupancyPercent=75
Load testing between 8-10am using some queries added to the load test scripts. I was able to force the search-engine to restart once in that period, near the beginning of the run.
Note:
Restarts since yesterday:
Resource Settings:
- resources:
limits:
cpu: '5'
memory: 20Gi
requests:
cpu: 500m
memory: 10Gi
- name: SOLR_JAVA_MEM
value: '-Xms8g -Xmx10g'
- name: GC_TUNE
value: >-
-XX:+UseG1GC -XX:+PerfDisableSharedMem
-XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=250
-XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent
-XX:InitiatingHeapOccupancyPercent=75
Restarts since last report:
Clearly there is no correlation between query volume and the OOM restarts.
All environments have been updated with the latest configurations.
Disabled the Solr cache on the search-engine; https://github.com/bcgov/aries-vcr/issues/758
Resource Settings:
2023-10-06 to 2023-10-08
- resources:
limits:
cpu: '5'
memory: 20Gi
requests:
cpu: 500m
memory: 10Gi
- name: SOLR_JAVA_MEM
value: '-Xms8g -Xmx10g'
- name: GC_TUNE
value: >-
-XX:+UseG1GC -XX:+PerfDisableSharedMem
-XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=250
-XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent
-XX:InitiatingHeapOccupancyPercent=75
2023-10-08 onward
- resources:
limits:
cpu: '6'
memory: 30Gi
requests:
cpu: '6'
memory: 20Gi
- name: SOLR_JAVA_MEM
value: '-Xms16g -Xmx24g'
- name: GC_TUNE
value: >-
-XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled
-XX:MaxGCPauseMillis=250 -XX:+AlwaysPreTouch
-XX:+ExplicitGCInvokesConcurrent
-XX:InitiatingHeapOccupancyPercent=75
Search-engine container throttling:
Restarts since last report:
Note resources were significantly increased in response to a heavy query load on 2023-10-08. Resources on other services were further adjusted on 2023-10-13 in response to the significant spike in query volume. The number of frontend and API instances were decreased and the query rate limit was also decreased.
One thing to note is that the search-engine experiences a very high percentage of throttling during the periods when it's restarting the most, even when the container has been given a significant allocation of CPU. CPU usage also spikes during these periods to the point where the container goes beyond it's allocation for long enough to be taken offline by k8s.
Reminder to monitor the active threads (top -H
) during heavy load. I suspect the GC threads will be the most active.
top - 16:00:14 up 130 days, 22:34, 0 users, load average: 9.78, 8.58, 9.21
Threads: 56 total, 3 running, 53 sleeping, 0 stopped, 0 zombie
%Cpu(s): 24.9/9.4 34[ ]
MiB Mem : 38.0/257605.4 [ ]
MiB Swap: 0.0/0.0 [ ]
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 1002370+ 20 0 2620 600 528 S 0.0 0.0 0:00.08 sh
7 1002370+ 20 0 35.5g 26.1g 1.2g S 0.0 10.4 0:00.02 `- java
58 1002370+ 20 0 35.5g 26.1g 1.2g S 0.0 10.4 0:02.06 `- java
59 1002370+ 20 0 35.5g 26.1g 1.2g S 11.6 10.4 37:13.99 `- GC Thread#0
60 1002370+ 20 0 35.5g 26.1g 1.2g S 0.0 10.4 0:01.46 `- G1 Main Marker
61 1002370+ 20 0 35.5g 26.1g 1.2g S 0.0 10.4 12:20.16 `- G1 Conc#0
62 1002370+ 20 0 35.5g 26.1g 1.2g S 11.3 10.4 37:13.85 `- GC Thread#1
63 1002370+ 20 0 35.5g 26.1g 1.2g S 11.3 10.4 37:13.04 `- GC Thread#2
64 1002370+ 20 0 35.5g 26.1g 1.2g S 11.6 10.4 37:13.75 `- GC Thread#3
65 1002370+ 20 0 35.5g 26.1g 1.2g S 11.6 10.4 37:13.75 `- GC Thread#4
66 1002370+ 20 0 35.5g 26.1g 1.2g S 12.0 10.4 37:13.48 `- GC Thread#5
67 1002370+ 20 0 35.5g 26.1g 1.2g S 0.3 10.4 13:42.93 `- G1 Refine#0
68 1002370+ 20 0 35.5g 26.1g 1.2g S 0.0 10.4 0:34.28 `- G1 Young RemSet
69 1002370+ 20 0 35.5g 26.1g 1.2g S 0.0 10.4 0:19.12 `- VM Thread
70 1002370+ 20 0 35.5g 26.1g 1.2g S 0.0 10.4 0:00.02 `- Reference Handl
71 1002370+ 20 0 35.5g 26.1g 1.2g S 0.0 10.4 0:00.00 `- Finalizer
72 1002370+ 20 0 35.5g 26.1g 1.2g S 0.0 10.4 0:00.00 `- Signal Dispatch
73 1002370+ 20 0 35.5g 26.1g 1.2g S 0.0 10.4 0:00.00 `- Service Thread
74 1002370+ 20 0 35.5g 26.1g 1.2g S 0.0 10.4 1:15.37 `- C2 CompilerThre
75 1002370+ 20 0 35.5g 26.1g 1.2g S 0.0 10.4 0:07.41 `- C1 CompilerThre
76 1002370+ 20 0 35.5g 26.1g 1.2g S 0.0 10.4 0:00.22 `- Sweeper thread
77 1002370+ 20 0 35.5g 26.1g 1.2g S 0.0 10.4 0:45.94 `- VM Periodic Tas
78 1002370+ 20 0 35.5g 26.1g 1.2g S 0.0 10.4 0:00.08 `- Common-Cleaner
81 1002370+ 20 0 35.5g 26.1g 1.2g S 0.3 10.4 3:20.85 `- Log4j2-TF-3-Asy
82 1002370+ 20 0 35.5g 26.1g 1.2g S 0.0 10.4 0:00.00 `- ShutdownMonitor
86 1002370+ 20 0 35.5g 26.1g 1.2g S 0.0 10.4 0:03.22 `- qtp1548081008-1
93 1002370+ 20 0 35.5g 26.1g 1.2g S 0.0 10.4 0:00.04 `- Session-HouseKe
94 1002370+ 20 0 35.5g 26.1g 1.2g S 0.0 10.4 12:20.68 `- G1 Conc#1
95 1002370+ 20 0 35.5g 26.1g 1.2g S 0.0 10.4 0:00.00 `- h2sc-6-thread-1
96 1002370+ 20 0 35.5g 26.1g 1.2g S 0.0 10.4 0:00.14 `- Connection evic
97 1002370+ 20 0 35.5g 26.1g 1.2g S 0.0 10.4 0:00.13 `- Connection evic
search-engine right before being OOM killed:
top - 15:18:30 up 19:56, 0 users, load average: 16.74, 30.24, 26.22
Threads: 257 total, 7 running, 250 sleeping, 0 stopped, 0 zombie
%Cpu(s): 26.8 us, 4.1 sy, 0.0 ni, 68.5 id, 0.0 wa, 0.3 hi, 0.4 si, 0.0 st
MiB Mem : 257605.4 total, 121852.3 free, 64512.1 used, 71241.0 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 190113.6 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
63 1002370+ 20 0 46.8g 26.6g 301972 R 99.3 10.6 2:32.57 GC Thread#2
59 1002370+ 20 0 46.8g 26.6g 301972 R 99.0 10.6 2:32.81 GC Thread#0
65 1002370+ 20 0 46.8g 26.6g 301972 R 99.0 10.6 2:32.30 GC Thread#4
66 1002370+ 20 0 46.8g 26.6g 301972 R 99.0 10.6 2:32.71 GC Thread#5
62 1002370+ 20 0 46.8g 26.6g 301972 R 98.7 10.6 2:32.95 GC Thread#1
64 1002370+ 20 0 46.8g 26.6g 301972 R 98.7 10.6 2:32.60 GC Thread#3
69 1002370+ 20 0 46.8g 26.6g 301972 S 0.7 10.6 0:01.38 VM Thread
407 1002370+ 20 0 9356 3972 3188 R 0.3 0.0 0:00.66 top
1 1002370+ 20 0 2620 604 528 S 0.0 0.0 0:00.06 sh
7 1002370+ 20 0 46.8g 26.6g 301972 S 0.0 10.6 0:00.01 java
58 1002370+ 20 0 46.8g 26.6g 301972 S 0.0 10.6 0:01.95 java
60 1002370+ 20 0 46.8g 26.6g 301972 S 0.0 10.6 0:00.00 G1 Main Marker
61 1002370+ 20 0 46.8g 26.6g 301972 S 0.0 10.6 0:18.66 G1 Conc#0
67 1002370+ 20 0 46.8g 26.6g 301972 S 0.0 10.6 0:34.62 G1 Refine#0
68 1002370+ 20 0 46.8g 26.6g 301972 S 0.0 10.6 0:00.16 G1 Young RemSet
70 1002370+ 20 0 46.8g 26.6g 301972 S 0.0 10.6 0:00.00 Reference Handl
71 1002370+ 20 0 46.8g 26.6g 301972 S 0.0 10.6 0:00.00 Finalizer
72 1002370+ 20 0 46.8g 26.6g 301972 S 0.0 10.6 0:00.00 Signal Dispatch
73 1002370+ 20 0 46.8g 26.6g 301972 S 0.0 10.6 0:00.00 Service Thread
74 1002370+ 20 0 46.8g 26.6g 301972 S 0.0 10.6 0:16.11 C2 CompilerThre
The implementation of the blocklist, https://github.com/bcgov/orgbook-configurations/pull/138, as allowed us to limit the unbounded queries (https://github.com/bcgov/aries-vcr/issues/762) causing the pressure on the search-engine resources. Therefore we're going to try out a resource reduction.
Since the introduction of the blocklist both the api
and search-engine
pods have seen far less activity and have been maintaining more sane resource levels. There have been no pod restarts, though the memory usage on the api
pods is growing.
For example:
Resource usage has been steady and under control for both the serach-engine and api pods. Memory use for the api pods has been stable. No pod restarts.
Commit history has been cleaned up, and the configurations have been deployed and tested in dev
, test
, and prod
.
Search-engine resource use:
I've been testing this configuration under load since 6:30am this morning. There have been a few instances of the OOM errors pop up. These typically occur during the time of high CPU usage. The CPU usage bumps into the pod's limit and causes the pod to be taken offline, in turn causing a service interruption.