bcgov / orgbook-configurations

Build and Deployment Configurations for the Indy-Catalyst version of the OrgBook
Apache License 2.0
3 stars 11 forks source link

Update search-engine resource allocations #137

Closed WadeBarnes closed 10 months ago

WadeBarnes commented 11 months ago

I've been testing this configuration under load since 6:30am this morning. There have been a few instances of the OOM errors pop up. These typically occur during the time of high CPU usage. The CPU usage bumps into the pod's limit and causes the pod to be taken offline, in turn causing a service interruption.

image

image

image

WadeBarnes commented 11 months ago

At first the high CPU usage seemed to be related to the times the BC Registries agent would process new credentials (BC Reg 0 and 30, LEAR 15 and 45), however that does not seem to be the case>

image

WadeBarnes commented 11 months ago

Start parameters for Solr (to look into the other options):

/usr/local/openjdk-11/bin/java-server-Xms512m-Xmx2048m-XX:+UseG1GC-XX:+PerfDisableSharedMem-XX:+ParallelRefProcEnabled-XX:MaxGCPauseMillis=250-XX:+UseLargePages-XX:+AlwaysPreTouch-XX:+ExplicitGCInvokesConcurrent-Xlog:gc*:file=/var/solr/logs/solr_gc.log:time,uptime:filecount=9,filesize=20M-Dsolr.jetty.inetaccess.includes=-Dsolr.jetty.inetaccess.excludes=-Dsolr.log.level=WARN-Dsolr.log.dir=/var/solr/logs-Djetty.port=8983-DSTOP.PORT=7983-DSTOP.KEY=solrrocks-Duser.timezone=UTC-XX:-OmitStackTraceInFastThrow-Djetty.home=/opt/solr/server-Dsolr.solr.home=/var/solr/data-Dsolr.data.home=-Dsolr.install.dir=/opt/solr-Dsolr.default.confdir=/opt/solr/server/solr/configsets/_default/conf-Dlog4j.configurationFile=/var/solr/log4j2.xml-Xss256k-jarstart.jar--module=http1002370000@search-engine-bc-59-498gs
WadeBarnes commented 11 months ago

Stopped load test and will allow the current pod to run over night and review the metrics and logs in the morning.

WadeBarnes commented 11 months ago

-XX:+UseLargePages might be something to look into. There are warnings related to this on pod startup; OpenJDK 64-Bit Server VM warning: Failed to reserve shared memory. (error = 12)

WadeBarnes commented 11 months ago

Also need to review API logs. Seems like some sync/scanning is going on at times throughout the day.

WadeBarnes commented 11 months ago

Increased CPU allocation to see if it would peek just about the 4CPU mark. CPU was pinned shortly after the pod was started:

image

image

image

WadeBarnes commented 11 months ago

Compare previous image to the current image. We did not have this many issues with the previous image. On the surface they look the same (same Solr version), but the config may be slightly different.

WadeBarnes commented 11 months ago

Info for the last 14hours. Became completely unresponsive at 4:23 AM. The pod was running and passing health checks, but was not responding to any search queries.

image

image

image

image

WadeBarnes commented 11 months ago

Rolled back to the previous search-engine image.

WadeBarnes commented 11 months ago

Image comparison:

There are no changes to any of the Solr config between the Aries-VCR commits, and both images contain the Log4j mitigations, indicating the base images were built off the same version of the openshift-solr source. This means the two images should behave the same.

WadeBarnes commented 11 months ago

Upgraded the base image to Solr 8.11.2 (built off the solr-8.11.2 branch for testing); https://github.com/bcgov/openshift-solr/pull/15

cmd line for Solr 8.11.2 container:

/opt/java/openjdk/bin/java-server-Xms512m-Xmx2048m-XX:+UseG1GC-XX:+PerfDisableSharedMem-XX:+ParallelRefProcEnabled-XX:MaxGCPauseMillis=250-XX:+UseLargePages-XX:+AlwaysPreTouch-XX:+ExplicitGCInvokesConcurrent-Xlog:gc*:file=/var/solr/logs/solr_gc.log:time,uptime:filecount=9,filesize=20M-Dsolr.jetty.inetaccess.includes=-Dsolr.jetty.inetaccess.excludes=-Dsolr.log.level=WARN-Dsolr.log.dir=/var/solr/logs-Djetty.port=8983-DSTOP.PORT=7983-DSTOP.KEY=solrrocks-Duser.timezone=UTC-XX:-OmitStackTraceInFastThrow-XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 8983 /var/solr/logs-Djetty.home=/opt/solr/server-Dsolr.solr.home=/var/solr/data-Dsolr.data.home=-Dsolr.install.dir=/opt/solr-Dsolr.default.confdir=/opt/solr/server/solr/configsets/_default/conf-Dlog4j.configurationFile=/var/solr/log4j2.xml-Xss256k-jarstart.jar--module=http--module=gzip10
WadeBarnes commented 11 months ago

New version still encounters OOM issues, but now has a script that runs to kill the process right away when OOM errors occur. Which means the pod gets restarted whenever it encounters an OOM error. So same problem, just a slightly different behavior.

WadeBarnes commented 11 months ago

Manually adjusted memory allocation and settings for testing:

- resources:
    limits:
      cpu: '5'
      memory: 8Gi
    requests:
      cpu: 500m
      memory: 3Gi
SOLR_JAVA_MEM=-Xms1024m -Xmx4096m
GC_TUNE=-XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=250 -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -XX:InitiatingHeapOccupancyPercent=75
WadeBarnes commented 11 months ago

After updating all the settings I scanned through the logs to identify queries associated with high load conditions that were causing service interruptions and incorporated them into the loadTest scripts. The search-engine as been stable under load from 4 instances of the script running concurrently for the last hour.

image

WadeBarnes commented 11 months ago

The search-engine pod faired much better over the past 19 hours, than it did the previous few days.

image

image

The pod was restarted a total of 4 times between ~3:45pm and 5pm. This was due to activity after the load testing was completed (~1:30 - 2:30). I'll be digging into the logs for details. There is some fine tuning needed.

search-engine restart times (due to OOM issues, via OOM killer script):

Recovery time (restart time) is typically under a minute.

WadeBarnes commented 11 months ago

Bytespider Traffic - Not huge Amazon Singapore 47.128.0.0/14 (47.128.0.0 - 47.131.255.255)

image

Other traffic: image

Other top IPs:

WadeBarnes commented 11 months ago

We need a better way to parse the IP addresses out of the logs in Kibana

WadeBarnes commented 11 months ago

@i5okie, @esune mentioned you might be able to help with the Kibana log parsing. What I'd like to do is parse the X-Forwarded-For field of the log message from logs like this:

image

WadeBarnes commented 11 months ago

Increased the logging level on the search-engine to see if we can capture the events leading up to the OOM issues. The issue could be related to credential indexing as opposed to search queries. Or, perhaps a combination of the two at times.

WadeBarnes commented 11 months ago

search-engine pod metrics since Sep 14, 2023, 9:50 AM:

image

image

The pod has restarted 8 times due to OOM errors:

WadeBarnes commented 11 months ago

Captured a log containing an OOM triggered restart. Solr-Error.log

WadeBarnes commented 11 months ago

Summary over the weekend:

Resource settings:

- resources:
    limits:
      cpu: '5'
      memory: 10Gi
    requests:
      cpu: 500m
      memory: 6Gi

  - name: SOLR_JAVA_MEM
    value: '-Xms2048m -Xmx5120m'
  - name: GC_TUNE
    value: >-
      -XX:+UseG1GC -XX:+PerfDisableSharedMem
      -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=250
      -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent
      -XX:InitiatingHeapOccupancyPercent=75

image

image

Restarts (19):

WadeBarnes commented 11 months ago

Finally found some good articles on troubleshooting issues with Solr configurations. The main takeaway is testing various heap sizes. The articles also cover more information on common problems with queries and caching configurations that can cause memory issues. Solr docs go into how to collect metrics and visualize the JVM heap. Good resources if we continue to run into issues.

WadeBarnes commented 11 months ago

I've refactored the load tests to, hopefully, exercise the search-engine more. I updated the search-engine configuration (increased memory and heap) and ran the load tests for a couple hours without issue. I'll let it run overnight at least to see how it handles the regular traffic loads.

WadeBarnes commented 11 months ago

Resource Settings:

- resources:
    limits:
      cpu: '5'
      memory: 16Gi
    requests:
      cpu: 500m
      memory: 8Gi

- name: SOLR_JAVA_MEM
  value: '-Xms4g -Xmx8g'
- name: GC_TUNE
  value: >-
    -XX:+UseG1GC -XX:+PerfDisableSharedMem
    -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=250
    -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent
    -XX:InitiatingHeapOccupancyPercent=75

The period between 6am and 9am was were I was performing load testing on the updated settings. There was the one restart during that time were I was pushing things a bit too hard. Load testing was performed with rate limiting adjusted to allow higher query volumes, otherwise most of the queries were getting blocked. Rate limiting was reapplied following the load testing.

image

image

Restarts since yesterday:

WadeBarnes commented 11 months ago

Resource Settings:

- resources:
    limits:
        cpu: '5'
        memory: 20Gi
    requests:
        cpu: 500m
        memory: 10Gi

- name: SOLR_JAVA_MEM
    value: '-Xms8g -Xmx10g'
- name: GC_TUNE
    value: >-
    -XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled
    -XX:MaxGCPauseMillis=250 -XX:+AlwaysPreTouch
    -XX:+ExplicitGCInvokesConcurrent
    -XX:InitiatingHeapOccupancyPercent=75

image

image

image

image

image

image

image

Restarts since yesterday:

WadeBarnes commented 11 months ago

Resource Settings:

- resources:
    limits:
        cpu: '5'
        memory: 20Gi
    requests:
        cpu: 500m
        memory: 10Gi

- name: SOLR_JAVA_MEM
    value: '-Xms8g -Xmx10g'
- name: GC_TUNE
    value: >-
    -XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled
    -XX:MaxGCPauseMillis=250 -XX:+AlwaysPreTouch
    -XX:+ExplicitGCInvokesConcurrent
    -XX:InitiatingHeapOccupancyPercent=75

image

image

image

image

Restarts since yesterday:

WadeBarnes commented 11 months ago

Resource Settings:

- resources:
    limits:
        cpu: '5'
        memory: 20Gi
    requests:
        cpu: 500m
        memory: 10Gi

- name: SOLR_JAVA_MEM
    value: '-Xms8g -Xmx10g'
- name: GC_TUNE
    value: >-
    -XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled
    -XX:MaxGCPauseMillis=250 -XX:+AlwaysPreTouch
    -XX:+ExplicitGCInvokesConcurrent
    -XX:InitiatingHeapOccupancyPercent=75

Load testing between 8-10am using some queries added to the load test scripts. I was able to force the search-engine to restart once in that period, near the beginning of the run.

Note:

image

image

image

image

Restarts since yesterday:

WadeBarnes commented 11 months ago

Resource Settings:

- resources:
    limits:
      cpu: '5'
      memory: 20Gi
    requests:
      cpu: 500m
      memory: 10Gi

- name: SOLR_JAVA_MEM
  value: '-Xms8g -Xmx10g'
- name: GC_TUNE
  value: >-
    -XX:+UseG1GC -XX:+PerfDisableSharedMem
    -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=250
    -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent
    -XX:InitiatingHeapOccupancyPercent=75

image

image

image

Restarts since last report:

WadeBarnes commented 11 months ago

Clearly there is no correlation between query volume and the OOM restarts.

WadeBarnes commented 11 months ago

All environments have been updated with the latest configurations.

WadeBarnes commented 11 months ago

Disabled the Solr cache on the search-engine; https://github.com/bcgov/aries-vcr/issues/758

WadeBarnes commented 10 months ago

Resource Settings:

2023-10-06 to 2023-10-08

- resources:
    limits:
      cpu: '5'
      memory: 20Gi
    requests:
      cpu: 500m
      memory: 10Gi

- name: SOLR_JAVA_MEM
  value: '-Xms8g -Xmx10g'
- name: GC_TUNE
  value: >-
    -XX:+UseG1GC -XX:+PerfDisableSharedMem
    -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=250
    -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent
    -XX:InitiatingHeapOccupancyPercent=75

2023-10-08 onward

- resources:
    limits:
      cpu: '6'
      memory: 30Gi
    requests:
      cpu: '6'
      memory: 20Gi

 - name: SOLR_JAVA_MEM
    value: '-Xms16g -Xmx24g'
  - name: GC_TUNE
    value: >-
      -XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled
      -XX:MaxGCPauseMillis=250 -XX:+AlwaysPreTouch
      -XX:+ExplicitGCInvokesConcurrent
      -XX:InitiatingHeapOccupancyPercent=75

image

image

image

Search-engine container throttling: image image image image

Restarts since last report:

Note resources were significantly increased in response to a heavy query load on 2023-10-08. Resources on other services were further adjusted on 2023-10-13 in response to the significant spike in query volume. The number of frontend and API instances were decreased and the query rate limit was also decreased.

One thing to note is that the search-engine experiences a very high percentage of throttling during the periods when it's restarting the most, even when the container has been given a significant allocation of CPU. CPU usage also spikes during these periods to the point where the container goes beyond it's allocation for long enough to be taken offline by k8s.

WadeBarnes commented 10 months ago

Reminder to monitor the active threads (top -H) during heavy load. I suspect the GC threads will be the most active.

top - 16:00:14 up 130 days, 22:34,  0 users,  load average: 9.78, 8.58, 9.21
Threads:  56 total,   3 running,  53 sleeping,   0 stopped,   0 zombie
%Cpu(s):  24.9/9.4    34[                                                                                                    ]
MiB Mem : 38.0/257605.4 [                                                                                                    ]
MiB Swap:  0.0/0.0      [                                                                                                    ]

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                 
      1 1002370+  20   0    2620    600    528 S   0.0   0.0   0:00.08 sh                                                                                                                                                      
      7 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.02  `- java                                                                                                                                                
     58 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:02.06      `- java                                                                                                                                            
     59 1002370+  20   0   35.5g  26.1g   1.2g S  11.6  10.4  37:13.99      `- GC Thread#0                                                                                                                                     
     60 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:01.46      `- G1 Main Marker                                                                                                                                  
     61 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4  12:20.16      `- G1 Conc#0                                                                                                                                       
     62 1002370+  20   0   35.5g  26.1g   1.2g S  11.3  10.4  37:13.85      `- GC Thread#1                                                                                                                                     
     63 1002370+  20   0   35.5g  26.1g   1.2g S  11.3  10.4  37:13.04      `- GC Thread#2                                                                                                                                     
     64 1002370+  20   0   35.5g  26.1g   1.2g S  11.6  10.4  37:13.75      `- GC Thread#3                                                                                                                                     
     65 1002370+  20   0   35.5g  26.1g   1.2g S  11.6  10.4  37:13.75      `- GC Thread#4                                                                                                                                     
     66 1002370+  20   0   35.5g  26.1g   1.2g S  12.0  10.4  37:13.48      `- GC Thread#5                                                                                                                                     
     67 1002370+  20   0   35.5g  26.1g   1.2g S   0.3  10.4  13:42.93      `- G1 Refine#0                                                                                                                                     
     68 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:34.28      `- G1 Young RemSet                                                                                                                                 
     69 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:19.12      `- VM Thread                                                                                                                                       
     70 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.02      `- Reference Handl                                                                                                                                 
     71 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.00      `- Finalizer                                                                                                                                       
     72 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.00      `- Signal Dispatch                                                                                                                                 
     73 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.00      `- Service Thread                                                                                                                                  
     74 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   1:15.37      `- C2 CompilerThre                                                                                                                                 
     75 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:07.41      `- C1 CompilerThre                                                                                                                                 
     76 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.22      `- Sweeper thread                                                                                                                                  
     77 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:45.94      `- VM Periodic Tas                                                                                                                                 
     78 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.08      `- Common-Cleaner                                                                                                                                  
     81 1002370+  20   0   35.5g  26.1g   1.2g S   0.3  10.4   3:20.85      `- Log4j2-TF-3-Asy                                                                                                                                 
     82 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.00      `- ShutdownMonitor                                                                                                                                 
     86 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:03.22      `- qtp1548081008-1                                                                                                                                 
     93 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.04      `- Session-HouseKe                                                                                                                                 
     94 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4  12:20.68      `- G1 Conc#1                                                                                                                                       
     95 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.00      `- h2sc-6-thread-1                                                                                                                                 
     96 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.14      `- Connection evic                                                                                                                                 
     97 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.13      `- Connection evic                                                                                                                                 
WadeBarnes commented 10 months ago

search-engine right before being OOM killed:

top - 15:18:30 up 19:56,  0 users,  load average: 16.74, 30.24, 26.22
Threads: 257 total,   7 running, 250 sleeping,   0 stopped,   0 zombie
%Cpu(s): 26.8 us,  4.1 sy,  0.0 ni, 68.5 id,  0.0 wa,  0.3 hi,  0.4 si,  0.0 st
MiB Mem : 257605.4 total, 121852.3 free,  64512.1 used,  71241.0 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used. 190113.6 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                 
     63 1002370+  20   0   46.8g  26.6g 301972 R  99.3  10.6   2:32.57 GC Thread#2                                                                                             
     59 1002370+  20   0   46.8g  26.6g 301972 R  99.0  10.6   2:32.81 GC Thread#0                                                                                             
     65 1002370+  20   0   46.8g  26.6g 301972 R  99.0  10.6   2:32.30 GC Thread#4                                                                                             
     66 1002370+  20   0   46.8g  26.6g 301972 R  99.0  10.6   2:32.71 GC Thread#5                                                                                             
     62 1002370+  20   0   46.8g  26.6g 301972 R  98.7  10.6   2:32.95 GC Thread#1                                                                                             
     64 1002370+  20   0   46.8g  26.6g 301972 R  98.7  10.6   2:32.60 GC Thread#3                                                                                             
     69 1002370+  20   0   46.8g  26.6g 301972 S   0.7  10.6   0:01.38 VM Thread                                                                                               
    407 1002370+  20   0    9356   3972   3188 R   0.3   0.0   0:00.66 top                                                                                                     
      1 1002370+  20   0    2620    604    528 S   0.0   0.0   0:00.06 sh                                                                                                      
      7 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:00.01 java                                                                                                    
     58 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:01.95 java                                                                                                    
     60 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:00.00 G1 Main Marker                                                                                          
     61 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:18.66 G1 Conc#0                                                                                               
     67 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:34.62 G1 Refine#0                                                                                             
     68 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:00.16 G1 Young RemSet                                                                                         
     70 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:00.00 Reference Handl                                                                                         
     71 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:00.00 Finalizer                                                                                               
     72 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:00.00 Signal Dispatch                                                                                         
     73 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:00.00 Service Thread                                                                                          
     74 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:16.11 C2 CompilerThre 
WadeBarnes commented 10 months ago

The implementation of the blocklist, https://github.com/bcgov/orgbook-configurations/pull/138, as allowed us to limit the unbounded queries (https://github.com/bcgov/aries-vcr/issues/762) causing the pressure on the search-engine resources. Therefore we're going to try out a resource reduction.

WadeBarnes commented 10 months ago

Since the introduction of the blocklist both the api and search-engine pods have seen far less activity and have been maintaining more sane resource levels. There have been no pod restarts, though the memory usage on the api pods is growing.

For example: image

WadeBarnes commented 10 months ago

Resource usage has been steady and under control for both the serach-engine and api pods. Memory use for the api pods has been stable. No pod restarts.

WadeBarnes commented 10 months ago

Commit history has been cleaned up, and the configurations have been deployed and tested in dev, test, and prod.

WadeBarnes commented 10 months ago

Search-engine resource use: image image image