Performance Issue with the app code

mthrilok commented 1 year ago

hi Dragon,

Check Reportable jobs were executed at 30 seconds average which is twice the expected time . This will add up to overall processing of tasks in the app ..Could you please review this request on priority as discussed in todays call .

We will share our findings in email .

lakshman301195 commented 1 year ago

Environment: Local dev instance testing

Infrastructure used: 8 C5.xlarge Instances each with a config of (4 vCPU, 8 GiB Memory) 6 Docker Containers each with a config of (1.8VCPU, 3.5GiB Memory) 1 db.r5.2xlarge RDS Instance in Multi AZ Mode

DB Scheduler Config db-scheduler.enabled=true db-scheduler.heartbeat-interval=1m db-scheduler.polling-interval=10s db-scheduler.polling-limit= db-scheduler.table-name=scheduled_tasks db-scheduler.immediate-execution-enabled=false db-scheduler.scheduler-name=eCRNow-Instance-1 db-scheduler.threads=15 db-scheduler.polling-strategy=lock-and-fetch db-scheduler.polling-strategy-lower-limit-fraction-of-threads=1.0 db-scheduler.polling-strategy-upper-limit-fraction-of-threads=3.0

List of Suspects:

Where the FHIR Client is getting Initialized
When an observation is not available and it makes a fhir call.
When it tries to move to next sub-action
After getting the HTTP Response, when it tries to retrieve FHIR resource from the response.

nbashyam commented 1 year ago

The App team will analyze the issues and discuss as part of the regular calls.

mthrilok commented 1 year ago

we used 30k requests for performing the above test with the mentioned infrastructure.

sathieshkumar commented 1 year ago

In further investigations, we found that long GC pauses are causing the task execution to increase. We can see during peak load these are GC pauses around 15-20 secs, some times even 30 seconds. This is happening for every min during peak load. GC logs can be found here: https://gceasy.io/my-gc-report.jsp?p=YXJjaGl2ZWQvMjAyMy8wNC8xMi81MmZjOGFhZi0yMjIwLTQ3MGUtYjI1Yi0xOWVhNjI2NDJkNDUudHh0LS0xNC05LTU1&channel=API

The below graph shows the GC pauses over time:

Some possible suggestions to reduce long GC pauses: https://blog.gceasy.io/2016/11/22/reduce-long-gc-pauses/

lakshman301195 commented 1 year ago

One more round of test with 1k requests and following are the observations.

Steps:

Perform a launch for around 1k Requests and let the check-immediate task execute and then check-reportable jobs get scheduled in DB.
Connect the app to Visual VM for Profiling.
Once the app starts and settles down note down the memory usage from the Visual VM Screen. It was observed to be around 490MB.
Now try updating the timestamp to an older time of only 1 check-reportable Job from the scheduled_tasks table.
Let the Job complete execution and note down an increase in Memory Usage. It increases to 700Mb from 490Mb. The memory usage never comes down.
Now continue the above pattern of updating the timestamp for two other tasks one by one in the scheduled_tasks table.
A step wise pattern in memory usage increase can be observed from 490Mb to 1.2Gb. After this point GC kicks in and memory usage comes down again to 500Mb.

nbashyam commented 1 year ago

Reproduce the following configuration:

Test Server - SimulateEHR (C5.xlarge Instances each with a config of (4 vCPU, 8 GiB Memory) eCRNow - (C5.xlarge Instances each with a config of (4 vCPU, 8 GiB Memory) DB - r5.xlarge RDS Instance in Multi AZ Mode

Run 1 instance of SimulateEHR (Can be run as Springboot or Tomcat) Run 1 docker container on one instance of eCRNow server. Run DB on DB server.

Run the load test, examine the results.

IF we can reproduce the issues, we can stop and use the configuration to address / understand the performance issues. Add second docker container for eCRNow on a different server

Run the load test, examine the results.

IF we can reproduce the issues, we can stop and use the configuration to address / understand the performance issues.

nbashyam commented 1 year ago

If we cannot reproduce, add additional docker containers on each server until we can reproduce the issue.

nbashyam commented 1 year ago

Add upto 3 docker containers on each server until the issue is reproduced. Limit to only 2 servers.

stram96 commented 1 year ago

Analysis from our threads dumps taken during testing:

Observed blocked thread in one of the thread dump. Thread "db-scheduler-pool-1-thread-10" is able to acquire lock on a method "ca.uhn.fhir.rest.client.apache.ApacheRestfulClientFactory" whereas the other thread "db-scheduler-pool-1-thread-34" is waiting to acquire lock on "ca.uhn.fhir.rest.client.apache.ApacheRestfulClientFactory". Thread stack for your reference:

db-scheduler-pool-1-thread-10: getResourceById has acquired lock ca.uhn.fhir.rest.client.apache.ApacheRestfulClientFactory

"db-scheduler-pool-1-thread-10" #56 prio=5 os_prio=0 cpu=103327.32ms elapsed=16944.60s tid=0x00007f3b9002c000 nid=0x46 waiting for monitor entry [0x00007f3b94fdb000] java.lang.Thread.State: BLOCKED (on object monitor) at ca.uhn.fhir.rest.client.impl.RestfulClientFactory.getServerValidationMode(RestfulClientFactory.java:105)

locked <0x0000000738565648> (a ca.uhn.fhir.rest.client.apache.ApacheRestfulClientFactory) at ca.uhn.fhir.rest.client.impl.RestfulClientFactory.validateServerBaseIfConfiguredToDoSo(RestfulClientFactory.java:259) at ca.uhn.fhir.rest.client.impl.BaseClient.invokeClient(BaseClient.java:227) at ca.uhn.fhir.rest.client.impl.GenericClient.doReadOrVRead(GenericClient.java:243) at ca.uhn.fhir.rest.client.impl.GenericClient.access$2500(GenericClient.java:175) at ca.uhn.fhir.rest.client.impl.GenericClient$ReadInternal.execute(GenericClient.java:1684) at com.drajer.bsa.ehr.service.impl.EhrFhirR4QueryServiceImpl.getResourceById(EhrFhirR4QueryServiceImpl.java:743)

db-scheduler-pool-1-thread-34: It is waiting to acquire lock on resource "ca.uhn.fhir.rest.client.apache.ApacheRestfulClientFactory" "db-scheduler-pool-1-thread-34" #80 prio=5 os_prio=0 cpu=98595.76ms elapsed=16944.59s tid=0x00007f3b9005e800 nid=0x5e waiting for monitor entry [0x00007f3b776f4000] java.lang.Thread.State: BLOCKED (on object monitor) at ca.uhn.fhir.rest.client.impl.RestfulClientFactory.newGenericClient(RestfulClientFactory.java:174)

waiting to lock <0x0000000738565648> (a ca.uhn.fhir.rest.client.apache.ApacheRestfulClientFactory) at ca.uhn.fhir.context.FhirContext.newRestfulGenericClient(FhirContext.java:826) at com.drajer.sof.utils.FhirContextInitializer.createClient(FhirContextInitializer.java:96) at com.drajer.bsa.ehr.service.impl.EhrFhirR4QueryServiceImpl.getClient(EhrFhirR4QueryServiceImpl.java:435) at com.drajer.bsa.ehr.service.impl.EhrFhirR4QueryServiceImpl.executeSearchQuery(EhrFhirR4QueryServiceImpl.java:1030)

mthrilok commented 1 year ago

hi @nbashyam -- Please review the thread dump analysis above . We will explain in our sync up call as well. Thanks

drajer-health / eCRNow

Performance Issue with the app code #488