Closed amprokop closed 7 years ago
@shellicious @laurjpeterson — wrote down an idea here. Please refine/mock up/do whatever. Feel free to have a post-standup discussion about it next week if you like.
Two things I'll add:
Here's some stats. Just looking at a Prometheus counter on FetchDocumentById
. At the 99% quantile, I am seeing that most documents are fetched in 6s. The multi-minutes latency is a rare event.
Although high latencies rare events, I do think it is important to tackle this issue.
Ah, I misread your issue. I didn't realize ListDocuments
is the problematic API.
It looks like 30s latency for this API is not uncommon.
Maybe 30s is the wrong number! Should it be more like 10 or 15 seconds? For users of a consumer-facing website, if it's longer than 10s, they're probably gone :) but VA is different of course.
to @mdbenjam:
- Are we sure those load times happen? Reader is under the same constraints and I haven't heard about these timeouts. Maybe this order of magnitude of wait times is when people start saying VBMS is down.
@shellicious shadowed a certification where VBMS wait times caused Cert not to start at all — likely the browser request timed out before the ListDocument request finished.
Is there any chance to prefetch files?
We tell users to change the VBMS dates to match VACOLS dates, and we use the ListDocuments call to tell whether or not dates are matching, so we can't cache that call. We could potentially prefetch the form itself, though we don't know exactly which cases will be certified.
When the user clicks “Start Certification” in VACOLS and is redirected to
/certifications/new/####
, we immediately show a spinner and a “Starting Certification" screen, and load the Check Documents page only when the VBMS/BGS/VACOLS requests complete.
+💯. This is what Reader does. It makes the app feel more responsive.
If we're living in full SPA land, then the initial SPA load should be as lightweight as possible. Don't make any data calls on the backend, especially ones to slow services. Instead, return just the HTML and JS to load the page, display a spinner, and then start on the data requests.
Another benefit of this is it makes it easy to distinguish between slowness of Caseflow and the other VA dependencies. When we can clearly message that it's the dependencies that are down, we can maintain user trust in Caseflow itself.
@shellicious — Hearings Prep and Reader have a similar spinner on app load. We'd like to make this as similar to theirs as possible. From a design perspective, do we want to show the user any messaging? (Like after 30s show "We're sorry, the Veteran's file is taking a long time to load" or something)
From a design perspective, do we want to show the user any messaging? (Like after 30s show "We're sorry, the Veteran's file is taking a long time to load" or something)
👍
From an implementation perspective, @amprokop, this would be a great place to ensure that we're sharing components as much as possible. In addition to the spinner itself, it would be nice to share a component that does the "show the spinner until the data is loaded, then show this other content" logic.
Absolutely, @NickHeiner — once we have some design guidance, we can hammer out those details (you spelled out mostly what i've been thinking though)
Problem
@shellicious observed in a Montgomery pilot that when users entered Certification, the page was blank and loading for several minutes and never loaded. We determined that this was because the VBMS request was taking between 2 and 5 minutes to successfully complete. Evidently, our user’s browser request was timing out before the VBMS request returned.
Right now, we don't launch the React app until our requests to VBMS (for document dates and form 9) BGS (for poa information) and VACOLS (for document dates and hearing info etc) are complete. It would be a better user experience if we gave the user more information about what was happening, especially when these requests take a long time.
However, it may be counterproductive to set a timeout on our VBMS requests, as we have observed that some eFolders are consistently very slow, and we don’t want to lock users out of using Caseflow Certification to certify appeals associated with those eFolders.
Potential Solution
When the user clicks “Start Certification” in VACOLS and is redirected to
/certifications/new/####
, we immediately show a spinner and a “Starting Certification" screen, and load the Check Documents page only when the VBMS/BGS/VACOLS requests complete.If, after 30 seconds, VBMS/BGS/VACOLS requests have not finished, display a message that says something like “Sorry! It’s taking a long time to start certification. We’re trying to fetch information from VBMS and VACOLS. Hold on, please…”
Open questions
How often do VBMS document list requests take 60 seconds or longer to complete? We have some experience with long-running VBMS requests, but to properly prioritize this, we should have some sense of how often VBMS really slows down.
Should we display an explicit error message and time out after some very long interval (5+ minutes?)
How should we save information about the status of the VBMS request? Should we wrap everything in a transaction?