Open jilladams opened 4 months ago
There is an entry for the search endpoint in this vets-api config file.
Search.gov base url's are defined in settings for search, search_typeahead, and search_click_tracking endpoints.
# Settings for search
search:
access_key: SEARCH_GOV_ACCESS_KEY
affiliate: va
mock_search: false
url: https://search.usa.gov/api/v2
Already using api.gsa.gov
:
# Settings for search-typeahead
search_typeahead:
api_key: API_GOV_ACCESS_KEY
name: va
url: https://api.gsa.gov/technology/searchgov/v1
# Settings for search-click-tracking
search_click_tracking:
access_key: SEARCH_GOV_ACCESS_KEY
affiliate: va
mock: false
url: https://api.gsa.gov/technology/searchgov/v2
Endpoint defnition:
lib/search/configuration.rb:
25 def base_path
26: "#{Settings.search.url}/search/i14y"
27 end
Tests/Betamocks
spec/support/vcr_cassettes/search/503.yml:
4 method: get
5: uri: https://search.usa.gov/api/v2/search/i14y?access_key=TESTKEY&affiliate=va&limit=10&offset=0&query=benefits
6 body:
spec/support/vcr_cassettes/search/504.yml:
4 method: get
5: uri: https://search.usa.gov/api/v2/search/i14y?access_key=TESTKEY&affiliate=va&limit=10&offset=0&query=benefits
6 body:
spec/support/vcr_cassettes/search/empty_query.yml:
4 method: get
5: uri: https://search.usa.gov/api/v2/search/i14y?access_key=TESTKEY&affiliate=va&limit=10&offset=0&query=
6 body:
spec/support/vcr_cassettes/search/exceeds_rate_limit.yml:
4 method: get
5: uri: https://search.usa.gov/api/v2/search/i14y?access_key=TESTKEY&affiliate=va&limit=10&offset=0&query=benefits
6 body:
spec/support/vcr_cassettes/search/invalid_access_key.yml:
4 method: get
5: uri: https://search.usa.gov/api/v2/search/i14y?access_key=INVALIDKEY&affiliate=va&limit=10&offset=0&query=benefits
6 body:
spec/support/vcr_cassettes/search/invalid_affiliate.yml:
4 method: get
5: uri: https://search.usa.gov/api/v2/search/i14y?access_key=TESTKEY&affiliate=INVALID&limit=10&offset=0&query=benefits
6 body:
spec/support/vcr_cassettes/search/last_page.yml:
4 method: get
5: uri: https://search.usa.gov/api/v2/search/i14y?access_key=TESTKEY&affiliate=va&limit=10&offset=80&query=benefits
6 body:
spec/support/vcr_cassettes/search/page_1.yml:
4 method: get
5: uri: https://search.usa.gov/api/v2/search/i14y?access_key=TESTKEY&affiliate=va&limit=10&offset=0&query=benefits
6 body:
spec/support/vcr_cassettes/search/page_2.yml:
4 method: get
5: uri: https://search.usa.gov/api/v2/search/i14y?access_key=TESTKEY&affiliate=va&limit=10&offset=10&query=benefits
6 body:
spec/support/vcr_cassettes/search/success_utf8.yml:
4 method: get
5: uri: https://search.usa.gov/api/v2/search/i14y?access_key=TESTKEY&affiliate=va&limit=10&offset=0&query=benefits
6 body:
spec/support/vcr_cassettes/search/success.yml:
4 method: get
5: uri: https://search.usa.gov/api/v2/search/i14y?access_key=TESTKEY&affiliate=va&limit=10&offset=0&query=benefits
6 body:
spec/support/vcr_cassettes/search_click_tracking/missing_parameter.yml:
4 method: post
5: uri: https://api.gsa.gov/technology/searchgov/v2/clicks/?access_key=TESTKEY&affiliate=va&module_code=I14Y&position=0&query=&url=https://www.testurl.com&user_agent=testUserAgent
6 body:
spec/support/vcr_cassettes/search_click_tracking/success.yml:
4 method: post
5: uri: https://api.gsa.gov/technology/searchgov/v2/clicks/?access_key=TESTKEY&affiliate=va&module_code=I14Y&position=0&query=testQuery&url=https://www.testurl.com&user_agent=testUserAgent
6 body:
❓ Should we also change the typeahead endpoint to use the https://api.gsa.gov/technology/searchgov/v2 url?
Great Q, I'll ask Jim.
From Search.gov:
The typeahead endpoint is actually already using the regular base URL, and actually also a number of other features of the data.api.gov system that the other endpoints don't use. The typeahead endpoint is its own special beast that has not been prioritized for a while, and is definitely on our list for future performance improvements.
With respect to the 500s, we are also seeing that spike, and have put in some mitigations, but are still dealing with something hammering the API servers. We are also in the middle of an infrastructure migration, which, when complete, should give us much better monitoring tools for dealing with the errors.
Settings changes are in vets-api. For testing: we don't know of a way to make this conditional in vets-api.
We cannot test this in a Tugboat. vets-api uses Review Instances, and we think we could test this in Review Instances.
We want to verify the 5 estimate once we can talk to @eselkin about how we might test in vets-api lower envs.
For testing: we don't know of a way to make this conditional in vets-api.
Unsolicited suggestion: could you make the base_path return value dependent on a feature toggle?
For testing: we don't know of a way to make this conditional in vets-api.
Unsolicited suggestion: could you make the base_path return value dependent on a feature toggle?
Yeah good point Adrian. We did discuss putting this behind a feature toggle, but weren't sure if we could/should do that in config settings, but as you pointed out we could actually do that in the configuration class. I'll pursue this as I think it is vital to be able to roll back easily and quickly should things not work as expected.
We need to make sure that after changes in this ticket, our API endpoint we're calling is monitored properly, and that our existing Search monitors that involve this endpoint make sense. That's a blocker to calling the update done, bc accurate monitoring can't lag too long behind the prod changes, since this is a P1 service. So I added that in an AC here. If we opt to break it out into a new ticket, it needs to happen in the same sprint or immediately following shipping the update. Sorry I missed this in refinement.
Our monitors are set up to monitor Search APM sort of globally. However, Search APM includes calls to 3 different endpoints:
Based on this conversation in the monitoring channel, if it's possible, we want to break up our monitors to be more clear about when SearchController proper is failing, vs. Typeahead noise.
Wouldn't it be nice to just handle all of that here. It's nice to want things, Jill, but: I'd be curious to hear from the person who picks up this ticket, whether that's reasonable or unreasonable to try to do that now or cut a follow up ticket.
(Also: if the endpoint is set in internals, monitoring might not be affected at all. I dunno.)
I went ahead and manually updated the Revision Date on this node, and set it to the value that is coming through in migration CSV, until we can work this ticket. Added a revision log. (My change may get overwritten by next migration, which is ok and may be a data point to help us understand what the migration is doing.)
@SnowboardTechie given your Rails background, we thought this might be a good candidate for this sprint, after your other assigned tickets are closed. Let's talk through it when you're ready to start, for context.
EOS: PRs likely to go up today.
From planning:
Mightcouldprobablyshould be a new ticket:
If possible to do at the same time: break apart search monitoring for the v0Search Controller from monitoring v0 Search typeahead
I'll stub that, since it's my brainchild. (FYI @FranECross )
@FranECross or @jilladams with this adjustment being introduced behind a feature flag it is being requested that we have a more detailed timeline for when the flag will be deprecated. My understanding is that we are introducing this behind a flag to be able to safely test this in production without needing to rollback if anything goes wrong. So I think we only need the flag for 1-4weeks depending on what feels appropriate to ya'll (longer gives us more time to verify in monitors that we are seeing the desired decrease in 503s and other errors). We likely want to add a ticket to this epic for the removal of this flag and the search.gov API endpoint from the codebase once ready to fully commit to this endpoint update. How does that sound to you?
In scrum, Fran / Jill agree that ~4 weeks sounds good.
@jilladams sounds great, ticket for the deprecation captured here #19278
https://github.com/department-of-veterans-affairs/vets-api/pull/18450 is all merged and I have successfully tested search on staging.va.gov with the feature flag both enabled and disabled. We are all set for this to deploy Monday for prod testing.
This is now live on production, feature flag enabled and search is functioning as intended.
@FranECross while initial testing on production went well. After a brief time we started receiving 429s due to a lack of fwd proxy being used with the new URLs. The feature flag will need to remain disabled until this is updated to use fwdproxy (probably over here)
be aware that the fwdproxy code has a new home here: https://github.com/department-of-veterans-affairs/vsp-platform-fwdproxy - however I think prod is still deployed from the devops repo, you'll need to check with platform devops to get the current status.
@SnowboardTechie props for handling your first prod escalation like a pro!
If I understood the Platform thread correctly today, there's a necessary step to update the AWS parameter store when we modify the endpoint. In addition to the fwdproxy updates, are you open to me scope-creeping this a bit more, to ask us to find an appropriate place to put documentation of that AWS parameter store update? Could be in code, or in product docs (somewhere in here? https://github.com/department-of-veterans-affairs/va.gov-team/tree/master/products/on-site-search/engineering).
Noting: we may need to ask Devops about docs to point engineers to update the fwdproxy when adding new API endpoints. We had thought the fwdproxy would understand host-level changes, vs. endpoint level changes.
I received notice that the Flipper check in the configuration for this service was causing a lot of slowdowns with environment builds. This required splitting the new endpoint into a separate service. At this time there are 4 remaining pieces to address for this to be completed:
Ah! I missed that last comment yesterday. So this means, in effect, the Search APM in Datadog is gonna end up split as well, since the SearchGSA is a new resource? Or it'll still roll up to Search APM? (Happy to chat about it if I have it all wrong / talking would be easier than writing out to clarify)
@jilladams the way I have this implemented it will be tagging both Search Service's errors with the same label Search/Results
and just slight variations in the log messages to help identify which is being used if something goes wrong. This should result in the Search APM in Datadog playing well with both.
Upon raising search traffic to 50% to the GSA endpoint via the forward proxy today we were once again rate limited. We have emailed search.gov requesting they address the limits.
@jilladams I was wondering if you'd like to close this ticket out and combine the final testing with #19414 ? This is fully implemented and functioning as intended until it hits rate limits so seems like a good combo to not leave this waiting for search.gov updates.
I'm reluctant to close it until we get to 100%, just becuase it's very easy to lose track of the details / where we are with staged rollout. In the past we've cut tix for 50% > 75% > 100% -- that's annoying bookkeeping, tho. In this case: Search.gov acknowledged they have to adjust our rate limits. Would you be comfortable with us leaving this open as Blocked until we get the ack from them that they've adjusted limits? At that point, we can start the bumping up. #19414 can happen in parallel. (I'm in product sync with Michelle / @FranECross , will try to see if this makes sense to them, since 19414 isn't prioritized yet, officially.)
(Went ahead and added label, status update to top of ticket, and moved to "Complete Pending Integration" which is the parking lot for situations like this, til we get unblocked.)
Absolutely no problem leaving this open/blocked. Especially hearing it has caused pain before, no worries.
Did get the greenlight on going ahead with #19414 to pull that in. Thank you!
Status
[2024-10-07] [Jill] Work is complete and in prod behind a flipper. Flipper is disabled until Search.gov adjusts our rate limits. When rate limits are increased, we will need to enable FLipper and step through 25%, 50%, 75%, 100% to confirm they can handle our traffic. We also need to implement #19414 for monitoring / preventive warning. [2024-08-15] [Fran] Updated ticket to include splitting apart monitoring for Search and Search Typeahead to enable applying different tolerances. Slack convo here. [2024-08-13] [Fran] Michelle M has approved this work, although it shouldn't bump critical or project work that's moving toward deadlines.
User Story or Problem Statement
API Endpoint Problem Statement: The Search.gov implementation has been running into monitoring issues, with large numbers of 50x errors from the Search typeahead and main search endpoints. Search.gov has asked us to use a different endpoint for our search queries: https://api.gsa.gov/technology/searchgov/v2/results/i14y
Search and Search Typeahead Monitoring Problem Statement Having the same monitoring tolerance for search typeahead and search is problematic in that the issues we're seeing with search typeahead don't constitute an actual search outage, and having it within regular search is causing unnecessary noise in the existing monitor, thus causing engineers/Jill/watch officers to perform unnecessary triage for something that isn't an outage.
Description or Additional Context
Endpoint Description:
Existing endpoint: https://search.usa.gov/api/v2/results/i14y Suggested new endpoint: https://api.gsa.gov/technology/searchgov/v2/results/i14y
According to search.gov, we only need to change the endpoint. Under the hood, these are the same and have the same parameters / will have the same responses. The new suggested endpoint will allow the Search.gov team to increase the monitoring of VA usage and understand the nature of our API calls and responses in order to better support VA.gov workflow.
Search Monitor Description
Create an additional monitor for only Search Typeahead and remove it from the existing Search monitor. The tolerance level should be very tolerant (engineering/Jill decision here).
Engineering Info
Existing search engineering documentation: search docs.
There are 3 ruby modules involved with Search. Search, search typeahead, and search click tracking. We are only changing the 'search' module.
Implementation Tasks
Endpoint Tasks
Monitoring Tasks
Updated settings:
Override Search base_path based on flipper.
Acceptance Criteria
~Mightcouldprobablyshould be a new ticket:~