Open mthrilok opened 1 year ago
Environment: Local dev instance testing
Infrastructure used: 8 C5.xlarge Instances each with a config of (4 vCPU, 8 GiB Memory) 6 Docker Containers each with a config of (1.8VCPU, 3.5GiB Memory) 1 db.r5.2xlarge RDS Instance in Multi AZ Mode
DB Scheduler Config db-scheduler.enabled=true db-scheduler.heartbeat-interval=1m db-scheduler.polling-interval=10s db-scheduler.polling-limit= db-scheduler.table-name=scheduled_tasks db-scheduler.immediate-execution-enabled=false db-scheduler.scheduler-name=eCRNow-Instance-1 db-scheduler.threads=15 db-scheduler.polling-strategy=lock-and-fetch db-scheduler.polling-strategy-lower-limit-fraction-of-threads=1.0 db-scheduler.polling-strategy-upper-limit-fraction-of-threads=3.0
List of Suspects:
The App team will analyze the issues and discuss as part of the regular calls.
we used 30k requests for performing the above test with the mentioned infrastructure.
In further investigations, we found that long GC pauses are causing the task execution to increase. We can see during peak load these are GC pauses around 15-20 secs, some times even 30 seconds. This is happening for every min during peak load. GC logs can be found here: https://gceasy.io/my-gc-report.jsp?p=YXJjaGl2ZWQvMjAyMy8wNC8xMi81MmZjOGFhZi0yMjIwLTQ3MGUtYjI1Yi0xOWVhNjI2NDJkNDUudHh0LS0xNC05LTU1&channel=API
The below graph shows the GC pauses over time:
Some possible suggestions to reduce long GC pauses: https://blog.gceasy.io/2016/11/22/reduce-long-gc-pauses/
One more round of test with 1k requests and following are the observations.
Steps:
Reproduce the following configuration:
Test Server - SimulateEHR (C5.xlarge Instances each with a config of (4 vCPU, 8 GiB Memory) eCRNow - (C5.xlarge Instances each with a config of (4 vCPU, 8 GiB Memory) DB - r5.xlarge RDS Instance in Multi AZ Mode
Run 1 instance of SimulateEHR (Can be run as Springboot or Tomcat) Run 1 docker container on one instance of eCRNow server. Run DB on DB server.
Run the load test, examine the results.
IF we can reproduce the issues, we can stop and use the configuration to address / understand the performance issues. Add second docker container for eCRNow on a different server
Run the load test, examine the results.
IF we can reproduce the issues, we can stop and use the configuration to address / understand the performance issues.
If we cannot reproduce, add additional docker containers on each server until we can reproduce the issue.
Add upto 3 docker containers on each server until the issue is reproduced. Limit to only 2 servers.
Analysis from our threads dumps taken during testing:
Observed blocked thread in one of the thread dump. Thread "db-scheduler-pool-1-thread-10" is able to acquire lock on a method "ca.uhn.fhir.rest.client.apache.ApacheRestfulClientFactory" whereas the other thread "db-scheduler-pool-1-thread-34" is waiting to acquire lock on "ca.uhn.fhir.rest.client.apache.ApacheRestfulClientFactory". Thread stack for your reference:
db-scheduler-pool-1-thread-10: getResourceById has acquired lock ca.uhn.fhir.rest.client.apache.ApacheRestfulClientFactory
"db-scheduler-pool-1-thread-10" #56 prio=5 os_prio=0 cpu=103327.32ms elapsed=16944.60s tid=0x00007f3b9002c000 nid=0x46 waiting for monitor entry [0x00007f3b94fdb000] java.lang.Thread.State: BLOCKED (on object monitor) at ca.uhn.fhir.rest.client.impl.RestfulClientFactory.getServerValidationMode(RestfulClientFactory.java:105)
db-scheduler-pool-1-thread-34: It is waiting to acquire lock on resource "ca.uhn.fhir.rest.client.apache.ApacheRestfulClientFactory" "db-scheduler-pool-1-thread-34" #80 prio=5 os_prio=0 cpu=98595.76ms elapsed=16944.59s tid=0x00007f3b9005e800 nid=0x5e waiting for monitor entry [0x00007f3b776f4000] java.lang.Thread.State: BLOCKED (on object monitor) at ca.uhn.fhir.rest.client.impl.RestfulClientFactory.newGenericClient(RestfulClientFactory.java:174)
hi @nbashyam -- Please review the thread dump analysis above . We will explain in our sync up call as well. Thanks
hi Dragon,
Check Reportable jobs were executed at 30 seconds average which is twice the expected time . This will add up to overall processing of tasks in the app ..Could you please review this request on priority as discussed in todays call .
We will share our findings in email .