Some charts not loading when > 30 people are accessing concurrently

ihburgess commented 1 year ago

If multiple (~30) people are using Clima at the same time, some charts do not load, ex. Heating / Cooling degree day chart. Refreshing the page sometimes helps and sometimes does not.

danielh7-cs9 commented 1 year ago

Just some initial findings from reviewing google cloud analytics. The memory utilization on the cloud container at times reaches >90%. For example, today at ~11:20AM -11:30AM

Something I also noticed is there the server throwing http 500 errors when loading certain tabs/content. Doesn't seem to be consistent but will need to dig deeper.

danielh7-cs9 commented 1 year ago

Just providing an update on findings and my analysis conducted so far.

My hypothesis for cause of this depicts not enough compute resources available for Clima to leverage when processing queries. In my opinion, this is justified in the number of HTTP server 500 responses and failed callback requests causing the Internal server errors. Exhibit below.

Firstly, I was able to replicate issue with certain graphs not loading on the browser. Screenshot below.

Additionally, in the recent reported issue #161 whereby users switching between SI and IP units “freezes” all the tabs except “select weather file” was also replicated.

Drilling into the google cloud instance metrics, I was able to observe when simulating load-test on the web application, the number of "TypeError" and "Attribute" Errors increased. Below screenshot details the logs just after I have run my scripts to simulate 30+ sessions to the web application.

To validate and test my hypothesis I built a separate google cloud run environment, and deployed the same clima code base for deeper sandbox testing. https://comp703-clima-xbfarm7u6a-ts.a.run.app/

Within this environment, There are two cloud compute configurations I've used for testing.

The first testing scenario was replicating the configuration against the current production Clima. The cloud configurations are as follows:

1G memory allocation
1 CPU allocation
80 concurrent sessions

Observations: When conducting the same load-testing on the cloud parameters above I was able to generate the same increasing HTTP 500 errors, failed callback errors, and ultimately impacted user experience from failed graphs loading.

It is also observed the number of "TypeError" and "AttributeError" logs on the sandbox Clima google cloud increased post load-testing, similar to what is seen on the production Clima.

The second testing scenario conducted on the sandbox Clima was increasing the compute configuration to the below: -8G memory allocation

4 CPU allocation -1K concurrent session allowance

Observations:

Firstly, in general the HTTP 500 server responses and failed callback errors are significantly reduced/not seen on the browser. It is noted I was able to generate one or two errors over the full testing cycle but in comparison to the current as-state, they are significantly less.
Secondly, graphs were generated and loaded correctly to the end user, including when load-test were run.
Switching between SI vs IP was seamless and the user was not required to select location EPW file and or "freeze" as depicted in issue #161
Post increasing the compute resources, the TypeErrors and AttributeErrors are stabilized. Below output was taken after deploying the revised container resources.

Next actions:

Work with the service owner to look at increasing the google cloud resourcing as defined in my sandbox testing (8G, 4CPU, 1K Concurrent sessions) to validate hypothesis.

FedericoTartarini commented 1 year ago

Thank you again for your help.

I am reading their docs and they say: Cloud Run provides a maximum concurrent requests per instance setting that specifies the maximum number of requests that can be processed simultaneously by a given container instance. If I interpret this correctly, we should reduce this number not increasing it otherwise more people will connect to the instance, hence, more resources will be needed and we will reach the RAM and CPU limits. Is this correct? Please read this.

I think that we should then try to set concurrency = 1 and see what happens.

I also found out that can specify the CPU, RAM, and concurrency programmatically using these commands which is much easier.

I am also a bit concerned about the costs if we increase both the CPU and RAM. Shall I try first to set the concurrency to 2 or 1 and then see what happens?

danielh7-cs9 commented 1 year ago

I have spent time testing the following cloud run parameters per Federico's suggestion on my sandbox clima https://comp703-clima-xbfarm7u6a-ts.a.run.app/

Memory = 1G CPU = 1 Concurrency = 1

Observations from testing those settings in general did not provide a good user experience, with degraded application performance observed.

Graphs were not generating and loading properly.
Evidence of multiple HTTP 500, callback errors
Switching between SI/IP also "froze" i.e. location was lost and required user to select location again.

I also tried increasing memory to 8G, 4CPU with concurrency still set as 1 but it made no improvement to the user experience and application errors. Also, both of these observations above were without any substantial load testing.

Totally understand your concerns around compute costs so perhaps we can trail the following (2G memory, 2 CPU, 80 concurrency) as compromise. Effectively just doubling what it prod clima currently has configured. I validated these configurations on my sandbox clima and the user experience and application performance was fairly consistent/on par with 8G memory/4 CPU settings. Additionally, I was not seeing HTTP 500 errors/call-back errors, graphs were generating/loading correctly including switching between SI/IP conversion. This was also observed when conducting load-testing to the cloud service.

FedericoTartarini commented 1 year ago

Thank you so much for testing it. One comment that I have is regarding this sentence:

I also tried increasing memory to 8G, 4CPU with concurrency still set as 1 but it made no improvement to the user experience and application errors.

Are you comparing these results to the 1G and 1 CPU or to the 2G memory, 2 CPU, and 80 concurrencies?

If this solution (8G, 4CPU) did not improve the performance it is very strange that this one 2G memory, 2 CPU, 80 concurrency is better since it has fewer resources allocated and a higher concurrency, i.e., resources are shared across users.

If on the other hand you are comparing the 8G with the 2G and they perform the same, then this is great since we can use it.

Last question I have for you, is about CPU and RAM, could you please share the utilization charts for these resources? We could consider increasing only one of the two if one is never utilized more 80% of its current allocation

danielh7-cs9 commented 1 year ago

re: Are you comparing these results to the 1G and 1 CPU or to the 2G memory, 2 CPU, and 80 concurrencies?

Results are in comparison to 1G, 1CPU and concurrency set as 1. Increasing RAM and CPU (8G, 4CPU) with concurrency still set as 1 made no improvements whatsoever.

Comparing 8G/4CPU/80 concurrency vs 2G/2CPU/80 concurrency was the same. In both cases I was not seeing HTTP 500 errors/call-back errors and the graphs were loading correctly including switching between SI/IP conversion.

re; Utilisation charts - Yep, I can grab them. For what parameters do you want to see i.e. 2G/2CPU ? Ultimately, this will be subjective to how much load there is. Good point with increasing one only, IMO RAM would be the one to increase per the below image. I've yet to trial the following parameters 2G/1CPU/80 Concurrency but could test and report the observations

FedericoTartarini commented 1 year ago

The CPU does not seems to be the issue.

On the other hand the RAM often goes to 100%. I deployed a new version of Clima with 2CPU and 2G RAM but we may not need 2 CPUs

danielh7-cs9 commented 1 year ago

Post changes, user experience does seems to be far better and improved. Graph loading is far more consistent and switching between SI and IP is stable.

I still see these errors on cloud metrics but they are far less persistent.

FedericoTartarini commented 1 year ago

Great, I also saw an improvement in the performances and the Clima app is now more responsive and fluid.

It would be great if we could also solve those errors. I was quickly looking into them by clicking on the issue and they may be caused by the cashing that is implemented with the decorator. I can turn that feature off and see if the issue persists. This may worsen a bit the performance of the application but may solve this problem.

danielh7-cs9 commented 1 year ago

re; > . I can turn that feature off and see if the issue persists. This may worsen a bit the performance of the application but may solve this problem.

Happy to try those proposed changes on my non-prod clima instance to ensure there is no major performance issues? Just let me know what needs to be changed from a code perspective.

One other thing, I did speak to a few work colleagues who manage our Openshift container environment, specifically around "concurrency" and there are many levels to it e.g. multi-thread, multi-processing etc but in summary, it more-or-less is just parallel.

I know I've harping on about concurrency but I do believe it has some affect to the Clima application performance (based on my testing)? Would you be open to increasing this to 40 or 80 as per what was originally configured with for testing purposes? Happy to support whatever is required :)

FedericoTartarini commented 1 year ago

Sure I will do it now. I have increased the concurrency to 80.

In terms of removing the cache, I belive it should be as simple as commenting out all the code wrappers like this one @cache.memoize(timeout=TIMEOUT)

We can also disable the cache by removing the following code in app.py

cache = Cache(
    app.server,
    config={
        "CACHE_TYPE": "flask_caching.backends.SimpleCache",
        "CACHE_DIR": "cache-directory",
    },
)

danielh7-cs9 commented 1 year ago

I was debugging another issue but did observe something causing the "KeyError:None" error which seems to be manifested when the "clear value" x on graphs is selected or if there are two selection criterion and one is cleared i.e. below

danielh7-cs9 commented 1 year ago

I removed all the caching code, code wrappers and redeployed to my clima instance. Unfortunately the logs are still persistent

FedericoTartarini commented 1 year ago

Yes, then unfortunately the issue is with the underlying code and not related to the caching. Please let me know if you find out the source of the error.

danielh7-cs9 commented 1 year ago

Closing issue as original symptom is resolved. Opening new issue to resolve the remaining errors presented in GCP.

CenterForTheBuiltEnvironment / clima

Some charts not loading when > 30 people are accessing concurrently #147