Closed TylerSchrag-NOAA closed 2 months ago
0308 Deployment to Production revealed that the rendering speed issue is not specific to the UAT environment (slow in prod as well). I'm currently adding FIM display filters into the 0308 fixes branch, as well as running some other tests now that the deployment is complete.
I'm going to include the email threads of the last couple weeks here for documentation:
The Symptom - As we've discussed over the last month or so, the integration of FIM4 into both the UAT and Prod environments sparked conversation with Ops regarding the time that FIM layers take to initially load. As testing of FIM4 progressed in UAT, we did notice and document that the new FIMpact layers could indeed take 15-25 seconds to initially render at CONUS level in a ArcGIS web viewer. This wasn't necessarily surprising with FIMpact building.heat map layers, given the size of underlying data... but we all also started noticing that even the FIM extent layers were taking a similar amount of time to render for ANA 14 Day and MRF FIM services. (after Monica's feedback this morning, as well as some other tests I've ran on FIM3 vs FIM4 extents, I'm convinced that nothing really did change significantly on the extent performance side. It might be that we just pushed through a threshold on the servers with the FIMpact layers that got us from say 90% CPU to 100% CPU, and that has affected everything... but they are behaving essentially the same in all of my tests).
The Underlying Issue - With some assistance from Bob to access the ArcGIS server nodes yesterday, I finally found the compute bottleneck that is causing these rendering issues - CPU (and potentially RAM) are regularly spiking to 100% on both of the baseline ArcGIS server nodes, even when only one user loads one of these large inundation services. Here is a screenshot of CPU on both nodes when I loaded the extent and FIMpact layers of the MRF Max Inundation service:
This shows the initial spike of a single user (which lasted for about 20 seconds before going back down to ~3-4%)... and this behavior on production is of course more dire, with regular sustained periods of 100% CPU use (although averages in AWS are still reported around 30% due to the spike nature, which is part of why no one noticed this before I suggested getting on the machines themselves to look):
The Design Problems - I figure there are two main problems here that we need to address, in addition to the solution options listed below:
Potential Solutions - While we answer the questions above, I figure we have a few options to decrease the server load of FIM service requests (and we should probably implement a combination of the options below):
3/22 Follow-Up:
Bob helped me monitor another test today, replicating the same MRF FIM service behavior as we captured in the screenshots above... this time with a liberal max scale of 1:500,000. This did significantly improve the CPU load on the server nodes, with spikes only up around 30-40%. Much better, but still a little concerning coming from only one user.
I figured AEP FIM was worth a test as well, as the most data-heavy service. This one is a little more worrying with spikes up closer to ~75% (+this one you set to 1:400k, which I probably should have chosen for the test above, in retrospect)
So it seems that we should definitely implement a scale filter like Monica originally suggested on the inundation services (1:400k), as soon as convenient.
I'd also think this discovery still warrants some follow-up consideration on the alerts/load testing history & plan for the ArcGIS HA setup as well (which may warrant further exploration of the other mitigations options). Not sure if prior tests did something like this, but we ought to be able to replicate an AEP FIM user test for X users using a script or service, in order to feel really confident in what could happen when X users pull up these services once public.
Update: This issue was in-fact a result of the ArcGIS server nodes being under intense CPU duress (details in comments below). The following actions have been suggested:
p.s. I created a new issue for 'ArcGIS Load Testing Plan' with longer-term follow-up items.