Service Rendering Speed - Investigate and Test Options

TylerSchrag-NOAA commented 1 year ago

Update: This issue was in-fact a result of the ArcGIS server nodes being under intense CPU duress (details in comments below). The following actions have been suggested:

[x] PR 1:400,000 min scale filters for all inundation extent layers
[x] Confirm updated inundation publishing on UAT

p.s. I created a new issue for 'ArcGIS Load Testing Plan' with longer-term follow-up items.

TylerSchrag-NOAA commented 1 year ago

0308 Deployment to Production revealed that the rendering speed issue is not specific to the UAT environment (slow in prod as well). I'm currently adding FIM display filters into the 0308 fixes branch, as well as running some other tests now that the deployment is complete.

TylerSchrag-NOAA commented 1 year ago

I'm going to include the email threads of the last couple weeks here for documentation:

The Symptom - As we've discussed over the last month or so, the integration of FIM4 into both the UAT and Prod environments sparked conversation with Ops regarding the time that FIM layers take to initially load. As testing of FIM4 progressed in UAT, we did notice and document that the new FIMpact layers could indeed take 15-25 seconds to initially render at CONUS level in a ArcGIS web viewer. This wasn't necessarily surprising with FIMpact building.heat map layers, given the size of underlying data... but we all also started noticing that even the FIM extent layers were taking a similar amount of time to render for ANA 14 Day and MRF FIM services. (after Monica's feedback this morning, as well as some other tests I've ran on FIM3 vs FIM4 extents, I'm convinced that nothing really did change significantly on the extent performance side. It might be that we just pushed through a threshold on the servers with the FIMpact layers that got us from say 90% CPU to 100% CPU, and that has affected everything... but they are behaving essentially the same in all of my tests).

The Underlying Issue - With some assistance from Bob to access the ArcGIS server nodes yesterday, I finally found the compute bottleneck that is causing these rendering issues - CPU (and potentially RAM) are regularly spiking to 100% on both of the baseline ArcGIS server nodes, even when only one user loads one of these large inundation services. Here is a screenshot of CPU on both nodes when I loaded the extent and FIMpact layers of the MRF Max Inundation service:

This shows the initial spike of a single user (which lasted for about 20 seconds before going back down to ~3-4%)... and this behavior on production is of course more dire, with regular sustained periods of 100% CPU use (although averages in AWS are still reported around 30% due to the spike nature, which is part of why no one noticed this before I suggested getting on the machines themselves to look):

The Design Problems - I figure there are two main problems here that we need to address, in addition to the solution options listed below:

Why are the alerts / auto-scaling features of the ArcGIS stack not kicking in here? I suspect that this comes down to some confusion about how exactly the ArcGIS HA features work / are setup... in that I believe they are designed more for GIS applications where the data-load of a single user isn't necessarily heavy, but many users ultimately create a cumulative CPU load that extends for long periods of time, triggering the initialization of more virtual server nodes. It would be good to know exactly what the parameters are for the auto scaling features and previous load tests that were done by Zach, Noel, and/or Justin... and more importantly, the exact parameters of alerts in ArcGIS monitor. We should at least be getting alerts for the CPU activity shown above, if not adding additional nodes. This seems like something that would be good for Bob to really understand and know how to work, so perhaps he can dig into the autoscaling setup more, and propose some options on how we can load test the services in a way that simulates 100s and 1,000s of requests for FIM map services. It would also be good to know if the RAM usage is normal for AGS (does it always use almost all available RAM), or if that is also a bottleneck here.
Is vectorized FIM just too much data? This one's on me. While you and I have spent a lot of time investigating the performance implications of vectorized FIM on the pipeline processes and databases. I frankly assumed that ArcGIS had a functionality setup for this kind of thing (some form of pyramid-building in the map image tiling that occurs behind the scenes), and did not do due diligence to understand the ArcGIS server performance implications of the design choice we made to vectorize FIM. The fact that I could single handedly max out both server nodes by opening the MRF FIM layers suggests to me that the server just can't tile that much data quickly enough to fill CONUS FIM requests in a prompt matter for one single user, let alone many.

Potential Solutions - While we answer the questions above, I figure we have a few options to decrease the server load of FIM service requests (and we should probably implement a combination of the options below):

Scale Filters - Per our conversations with Monica, if we set FIM extents to not draw until zoomed into say, the city scale, this should dramatically reduce this problem. I'm not convinced it will go away entirely... so having an updated load testing plan will be important... but I think this is an easy thing to run some tests on / implement ASAP next week.
Go Back to Raster Extents - Obviously we could go back to showing FIM extents as a raster service for the sake of performance (those render faster because of the pyramids functionality). We'd still leave the vector part of the pipelines for the sake of the FIMpact layers... but branch off to optimize / mosaic the rasters as well... like what you/Shawn have setup for SCHISM. You'd know better, but I don't think this would be that cumbersome to implement... but it undoes all the great user-friendly benefits of vectorizing... notably the nice legend and pop-up metadata.
Tile Caching - I'm not sure if this really possible given the frequency by which we update data... but we could look into caching the tiles of the map image services. I believe this theoretically would result in a very high server load immediately following the update of FIM services (we'd want to look into having a non user-faced server do this, like GP, if that's possible), but then it would behave with the same performance as a raster service with pyramids, but still giving us the benefits of a map service, like pop-ups.
Simplification - Of course, we can always look into further simplification of the FIM polygons, to reduce the number of vertices displayed. This would move us away from showing the exact 10m grid shapes that represent the rasters (it would smooth out all the edges), so we'd have to do some collaboration with the FIM team / field to agree to an acceptable level of generalization, that they feel matches with the general precision of the model and doesn't reduce quality of the output significantly.
Increase the Instance Size of the Server Nodes - We of course have the option to scale up to m5.2XL (or other) EC2 instances for the server nodes, but we shouldn't consider this until we at least feel like the current setup can handle, say 100 simultaneous users.

TylerSchrag-NOAA commented 1 year ago

3/22 Follow-Up:

Bob helped me monitor another test today, replicating the same MRF FIM service behavior as we captured in the screenshots above... this time with a liberal max scale of 1:500,000. This did significantly improve the CPU load on the server nodes, with spikes only up around 30-40%. Much better, but still a little concerning coming from only one user.

I figured AEP FIM was worth a test as well, as the most data-heavy service. This one is a little more worrying with spikes up closer to ~75% (+this one you set to 1:400k, which I probably should have chosen for the test above, in retrospect)

So it seems that we should definitely implement a scale filter like Monica originally suggested on the inundation services (1:400k), as soon as convenient.

I'd also think this discovery still warrants some follow-up consideration on the alerts/load testing history & plan for the ArcGIS HA setup as well (which may warrant further exploration of the other mitigations options). Not sure if prior tests did something like this, but we ought to be able to replicate an AEP FIM user test for X users using a script or service, in order to feel really confident in what could happen when X users pull up these services once public.

NOAA-OWP / hydrovis

Service Rendering Speed - Investigate and Test Options #332