Closed ihburgess closed 1 year ago
Just some initial findings from reviewing google cloud analytics. The memory utilization on the cloud container at times reaches >90%. For example, today at ~11:20AM -11:30AM
Something I also noticed is there the server throwing http 500 errors when loading certain tabs/content. Doesn't seem to be consistent but will need to dig deeper.
Just providing an update on findings and my analysis conducted so far.
My hypothesis for cause of this depicts not enough compute resources available for Clima to leverage when processing queries. In my opinion, this is justified in the number of HTTP server 500 responses and failed callback requests causing the Internal server errors. Exhibit below.
Firstly, I was able to replicate issue with certain graphs not loading on the browser. Screenshot below.
Additionally, in the recent reported issue #161 whereby users switching between SI and IP units “freezes” all the tabs except “select weather file” was also replicated.
Drilling into the google cloud instance metrics, I was able to observe when simulating load-test on the web application, the number of "TypeError" and "Attribute" Errors increased. Below screenshot details the logs just after I have run my scripts to simulate 30+ sessions to the web application.
To validate and test my hypothesis I built a separate google cloud run environment, and deployed the same clima code base for deeper sandbox testing. https://comp703-clima-xbfarm7u6a-ts.a.run.app/
Within this environment, There are two cloud compute configurations I've used for testing.
The first testing scenario was replicating the configuration against the current production Clima. The cloud configurations are as follows:
Observations: When conducting the same load-testing on the cloud parameters above I was able to generate the same increasing HTTP 500 errors, failed callback errors, and ultimately impacted user experience from failed graphs loading.
It is also observed the number of "TypeError" and "AttributeError" logs on the sandbox Clima google cloud increased post load-testing, similar to what is seen on the production Clima.
The second testing scenario conducted on the sandbox Clima was increasing the compute configuration to the below: -8G memory allocation
Observations:
Next actions:
Thank you again for your help.
I am reading their docs and they say: Cloud Run provides a maximum concurrent requests per instance setting that specifies the maximum number of requests that can be processed simultaneously by a given container instance. If I interpret this correctly, we should reduce this number not increasing it otherwise more people will connect to the instance, hence, more resources will be needed and we will reach the RAM and CPU limits. Is this correct? Please read this.
I think that we should then try to set concurrency = 1 and see what happens.
I also found out that can specify the CPU, RAM, and concurrency programmatically using these commands which is much easier.
I am also a bit concerned about the costs if we increase both the CPU and RAM. Shall I try first to set the concurrency to 2 or 1 and then see what happens?
I have spent time testing the following cloud run parameters per Federico's suggestion on my sandbox clima https://comp703-clima-xbfarm7u6a-ts.a.run.app/
Memory = 1G CPU = 1 Concurrency = 1
Observations from testing those settings in general did not provide a good user experience, with degraded application performance observed.
I also tried increasing memory to 8G, 4CPU with concurrency still set as 1 but it made no improvement to the user experience and application errors. Also, both of these observations above were without any substantial load testing.
Totally understand your concerns around compute costs so perhaps we can trail the following (2G memory, 2 CPU, 80 concurrency) as compromise. Effectively just doubling what it prod clima currently has configured. I validated these configurations on my sandbox clima and the user experience and application performance was fairly consistent/on par with 8G memory/4 CPU settings. Additionally, I was not seeing HTTP 500 errors/call-back errors, graphs were generating/loading correctly including switching between SI/IP conversion. This was also observed when conducting load-testing to the cloud service.
Thank you so much for testing it. One comment that I have is regarding this sentence:
I also tried increasing memory to 8G, 4CPU with concurrency still set as 1 but it made no improvement to the user experience and application errors.
Are you comparing these results to the 1G and 1 CPU or to the 2G memory, 2 CPU, and 80 concurrencies?
If this solution (8G, 4CPU) did not improve the performance it is very strange that this one 2G memory, 2 CPU, 80 concurrency is better since it has fewer resources allocated and a higher concurrency, i.e., resources are shared across users.
If on the other hand you are comparing the 8G with the 2G and they perform the same, then this is great since we can use it.
Last question I have for you, is about CPU and RAM, could you please share the utilization charts for these resources? We could consider increasing only one of the two if one is never utilized more 80% of its current allocation
re: Are you comparing these results to the 1G and 1 CPU or to the 2G memory, 2 CPU, and 80 concurrencies?
Results are in comparison to 1G, 1CPU and concurrency set as 1. Increasing RAM and CPU (8G, 4CPU) with concurrency still set as 1 made no improvements whatsoever.
Comparing 8G/4CPU/80 concurrency vs 2G/2CPU/80 concurrency was the same. In both cases I was not seeing HTTP 500 errors/call-back errors and the graphs were loading correctly including switching between SI/IP conversion.
re; Utilisation charts - Yep, I can grab them. For what parameters do you want to see i.e. 2G/2CPU ? Ultimately, this will be subjective to how much load there is. Good point with increasing one only, IMO RAM would be the one to increase per the below image. I've yet to trial the following parameters 2G/1CPU/80 Concurrency but could test and report the observations
The CPU does not seems to be the issue.
On the other hand the RAM often goes to 100%. I deployed a new version of Clima with 2CPU and 2G RAM but we may not need 2 CPUs
Post changes, user experience does seems to be far better and improved. Graph loading is far more consistent and switching between SI and IP is stable.
I still see these errors on cloud metrics but they are far less persistent.
Great, I also saw an improvement in the performances and the Clima app is now more responsive and fluid.
It would be great if we could also solve those errors. I was quickly looking into them by clicking on the issue and they may be caused by the cashing that is implemented with the decorator. I can turn that feature off and see if the issue persists. This may worsen a bit the performance of the application but may solve this problem.
re; > . I can turn that feature off and see if the issue persists. This may worsen a bit the performance of the application but may solve this problem.
Happy to try those proposed changes on my non-prod clima instance to ensure there is no major performance issues? Just let me know what needs to be changed from a code perspective.
One other thing, I did speak to a few work colleagues who manage our Openshift container environment, specifically around "concurrency" and there are many levels to it e.g. multi-thread, multi-processing etc but in summary, it more-or-less is just parallel.
I know I've harping on about concurrency but I do believe it has some affect to the Clima application performance (based on my testing)? Would you be open to increasing this to 40 or 80 as per what was originally configured with for testing purposes? Happy to support whatever is required :)
Sure I will do it now. I have increased the concurrency to 80.
In terms of removing the cache, I belive it should be as simple as commenting out all the code wrappers like this one @cache.memoize(timeout=TIMEOUT)
We can also disable the cache by removing the following code in app.py
cache = Cache(
app.server,
config={
"CACHE_TYPE": "flask_caching.backends.SimpleCache",
"CACHE_DIR": "cache-directory",
},
)
I was debugging another issue but did observe something causing the "KeyError:None" error which seems to be manifested when the "clear value" x on graphs is selected or if there are two selection criterion and one is cleared i.e. below
I removed all the caching code, code wrappers and redeployed to my clima instance. Unfortunately the logs are still persistent
Yes, then unfortunately the issue is with the underlying code and not related to the caching. Please let me know if you find out the source of the error.
Closing issue as original symptom is resolved. Opening new issue to resolve the remaining errors presented in GCP.
If multiple (~30) people are using Clima at the same time, some charts do not load, ex. Heating / Cooling degree day chart. Refreshing the page sometimes helps and sometimes does not.