department-of-veterans-affairs / va.gov-cms

Editor-centered management for Veteran-centered content.
https://prod.cms.va.gov
GNU General Public License v2.0
99 stars 69 forks source link

Update Search.gov implementation to use different API endpoint #18736

Open jilladams opened 4 months ago

jilladams commented 4 months ago

Status

[2024-10-07] [Jill] Work is complete and in prod behind a flipper. Flipper is disabled until Search.gov adjusts our rate limits. When rate limits are increased, we will need to enable FLipper and step through 25%, 50%, 75%, 100% to confirm they can handle our traffic. We also need to implement #19414 for monitoring / preventive warning. [2024-08-15] [Fran] Updated ticket to include splitting apart monitoring for Search and Search Typeahead to enable applying different tolerances. Slack convo here. [2024-08-13] [Fran] Michelle M has approved this work, although it shouldn't bump critical or project work that's moving toward deadlines.

User Story or Problem Statement

Description or Additional Context

Endpoint Description:

Existing endpoint: https://search.usa.gov/api/v2/results/i14y Suggested new endpoint: https://api.gsa.gov/technology/searchgov/v2/results/i14y

According to search.gov, we only need to change the endpoint. Under the hood, these are the same and have the same parameters / will have the same responses. The new suggested endpoint will allow the Search.gov team to increase the monitoring of VA usage and understand the nature of our API calls and responses in order to better support VA.gov workflow.

Search Monitor Description

Create an additional monitor for only Search Typeahead and remove it from the existing Search monitor. The tolerance level should be very tolerant (engineering/Jill decision here).

Engineering Info

Existing search engineering documentation: search docs.

There are 3 ruby modules involved with Search. Search, search typeahead, and search click tracking. We are only changing the 'search' module.

Implementation Tasks

Endpoint Tasks

Monitoring Tasks

Updated settings:

# Settings for search using api.gsa.gov
search:
  access_key: SEARCH_GOV_ACCESS_KEY
  affiliate: va
  mock_search: false
  gsa_url: https://api.gsa.gov/technology/searchgov/v2
  url: https://search.usa.gov/api/v2

Override Search base_path based on flipper.

def base_path
  if Flipper.enabled?(:search_use_v2_gsa, current_user)
    "#{Settings.search.gsa_url}/results/i14y"
  else
    "#{Settings.search.url}/search/i14y"
  end
end

Acceptance Criteria

~Mightcouldprobablyshould be a new ticket:~

randimays commented 4 months ago

There is an entry for the search endpoint in this vets-api config file.

dsasser commented 4 months ago

Engineering Pre-finement notes:

Search.gov base url's are defined in settings for search, search_typeahead, and search_click_tracking endpoints.

# Settings for search
search:
  access_key: SEARCH_GOV_ACCESS_KEY
  affiliate: va
  mock_search: false
  url: https://search.usa.gov/api/v2

Already using api.gsa.gov:

# Settings for search-typeahead
search_typeahead:
  api_key: API_GOV_ACCESS_KEY
  name: va
  url: https://api.gsa.gov/technology/searchgov/v1

  # Settings for search-click-tracking
search_click_tracking:
  access_key: SEARCH_GOV_ACCESS_KEY
  affiliate: va
  mock: false
  url: https://api.gsa.gov/technology/searchgov/v2

Endpoint defnition:

lib/search/configuration.rb:
  25      def base_path
  26:       "#{Settings.search.url}/search/i14y"
  27      end

Tests/Betamocks


spec/support/vcr_cassettes/search/503.yml:
  4      method: get
  5:     uri: https://search.usa.gov/api/v2/search/i14y?access_key=TESTKEY&affiliate=va&limit=10&offset=0&query=benefits
  6      body:

spec/support/vcr_cassettes/search/504.yml:
  4      method: get
  5:     uri: https://search.usa.gov/api/v2/search/i14y?access_key=TESTKEY&affiliate=va&limit=10&offset=0&query=benefits
  6      body:

spec/support/vcr_cassettes/search/empty_query.yml:
  4      method: get
  5:     uri: https://search.usa.gov/api/v2/search/i14y?access_key=TESTKEY&affiliate=va&limit=10&offset=0&query=
  6      body:

spec/support/vcr_cassettes/search/exceeds_rate_limit.yml:
  4      method: get
  5:     uri: https://search.usa.gov/api/v2/search/i14y?access_key=TESTKEY&affiliate=va&limit=10&offset=0&query=benefits
  6      body:

spec/support/vcr_cassettes/search/invalid_access_key.yml:
  4      method: get
  5:     uri: https://search.usa.gov/api/v2/search/i14y?access_key=INVALIDKEY&affiliate=va&limit=10&offset=0&query=benefits
  6      body:

spec/support/vcr_cassettes/search/invalid_affiliate.yml:
  4      method: get
  5:     uri: https://search.usa.gov/api/v2/search/i14y?access_key=TESTKEY&affiliate=INVALID&limit=10&offset=0&query=benefits
  6      body:

spec/support/vcr_cassettes/search/last_page.yml:
  4      method: get
  5:     uri: https://search.usa.gov/api/v2/search/i14y?access_key=TESTKEY&affiliate=va&limit=10&offset=80&query=benefits
  6      body:

spec/support/vcr_cassettes/search/page_1.yml:
  4      method: get
  5:     uri: https://search.usa.gov/api/v2/search/i14y?access_key=TESTKEY&affiliate=va&limit=10&offset=0&query=benefits
  6      body:

spec/support/vcr_cassettes/search/page_2.yml:
  4      method: get
  5:     uri: https://search.usa.gov/api/v2/search/i14y?access_key=TESTKEY&affiliate=va&limit=10&offset=10&query=benefits
  6      body:

spec/support/vcr_cassettes/search/success_utf8.yml:
  4      method: get
  5:     uri: https://search.usa.gov/api/v2/search/i14y?access_key=TESTKEY&affiliate=va&limit=10&offset=0&query=benefits
  6      body:

spec/support/vcr_cassettes/search/success.yml:
  4      method: get
  5:     uri: https://search.usa.gov/api/v2/search/i14y?access_key=TESTKEY&affiliate=va&limit=10&offset=0&query=benefits
  6      body:

spec/support/vcr_cassettes/search_click_tracking/missing_parameter.yml:
  4        method: post
  5:       uri: https://api.gsa.gov/technology/searchgov/v2/clicks/?access_key=TESTKEY&affiliate=va&module_code=I14Y&position=0&query=&url=https://www.testurl.com&user_agent=testUserAgent
  6        body:

spec/support/vcr_cassettes/search_click_tracking/success.yml:
  4        method: post
  5:       uri: https://api.gsa.gov/technology/searchgov/v2/clicks/?access_key=TESTKEY&affiliate=va&module_code=I14Y&position=0&query=testQuery&url=https://www.testurl.com&user_agent=testUserAgent
  6        body:

❓ Should we also change the typeahead endpoint to use the https://api.gsa.gov/technology/searchgov/v2 url?

jilladams commented 4 months ago

Great Q, I'll ask Jim.

jilladams commented 4 months ago

From Search.gov:

The typeahead endpoint is actually already using the regular base URL, and actually also a number of other features of the data.api.gov system that the other endpoints don't use. The typeahead endpoint is its own special beast that has not been prioritized for a while, and is definitely on our list for future performance improvements.

With respect to the 500s, we are also seeing that spike, and have put in some mitigations, but are still dealing with something hammering the API servers. We are also in the middle of an infrastructure migration, which, when complete, should give us much better monitoring tools for dealing with the errors.

jilladams commented 4 months ago

Settings changes are in vets-api. For testing: we don't know of a way to make this conditional in vets-api.

We cannot test this in a Tugboat. vets-api uses Review Instances, and we think we could test this in Review Instances.

We want to verify the 5 estimate once we can talk to @eselkin about how we might test in vets-api lower envs.

acrollet commented 4 months ago

For testing: we don't know of a way to make this conditional in vets-api.

Unsolicited suggestion: could you make the base_path return value dependent on a feature toggle?

dsasser commented 4 months ago

For testing: we don't know of a way to make this conditional in vets-api.

Unsolicited suggestion: could you make the base_path return value dependent on a feature toggle?

Yeah good point Adrian. We did discuss putting this behind a feature toggle, but weren't sure if we could/should do that in config settings, but as you pointed out we could actually do that in the configuration class. I'll pursue this as I think it is vital to be able to roll back easily and quickly should things not work as expected.

jilladams commented 3 months ago

A pontification on monitoring

Pertinent to this ticket

We need to make sure that after changes in this ticket, our API endpoint we're calling is monitored properly, and that our existing Search monitors that involve this endpoint make sense. That's a blocker to calling the update done, bc accurate monitoring can't lag too long behind the prod changes, since this is a P1 service. So I added that in an AC here. If we opt to break it out into a new ticket, it needs to happen in the same sprint or immediately following shipping the update. Sorry I missed this in refinement.

Bigger picture on Search

Our monitors are set up to monitor Search APM sort of globally. However, Search APM includes calls to 3 different endpoints: Screenshot 2024-08-14 at 2 09 49 PM

  1. V0::SearchController#index: When calls to this endpoint fail, a site user cannot search. Failures to this endpoint matter a lot. Historically a quieter / more stable endpoint.
  2. V0::SearchClickTrackingController#create - When calls to this endpoint fail, clicks don't get tracked by Search.gov. We don't care that much. Historically very noisy with errors.
  3. V0::SearchTypeaheadController#index - When calls to this endpoint fail, a user doesn't receive back typeahead responses from Search.gov, but search can still be completed. It's a UX nicety, but it's not mission critical. Historically very noisy with errors.

Based on this conversation in the monitoring channel, if it's possible, we want to break up our monitors to be more clear about when SearchController proper is failing, vs. Typeahead noise.

Wouldn't it be nice to just handle all of that here. It's nice to want things, Jill, but: I'd be curious to hear from the person who picks up this ticket, whether that's reasonable or unreasonable to try to do that now or cut a follow up ticket.

jilladams commented 3 months ago

(Also: if the endpoint is set in internals, monitoring might not be affected at all. I dunno.)

jilladams commented 3 months ago

I went ahead and manually updated the Revision Date on this node, and set it to the value that is coming through in migration CSV, until we can work this ticket. Added a revision log. (My change may get overwritten by next migration, which is ok and may be a data point to help us understand what the migration is doing.)

jilladams commented 2 months ago

@SnowboardTechie given your Rails background, we thought this might be a good candidate for this sprint, after your other assigned tickets are closed. Let's talk through it when you're ready to start, for context.

jilladams commented 2 months ago

EOS: PRs likely to go up today.

jilladams commented 2 months ago

From planning:

Mightcouldprobablyshould be a new ticket:

If possible to do at the same time: break apart search monitoring for the v0Search Controller from monitoring v0 Search typeahead

I'll stub that, since it's my brainchild. (FYI @FranECross )

jilladams commented 2 months ago

Now in https://github.com/department-of-veterans-affairs/va.gov-cms/issues/19256

SnowboardTechie commented 2 months ago

@FranECross or @jilladams with this adjustment being introduced behind a feature flag it is being requested that we have a more detailed timeline for when the flag will be deprecated. My understanding is that we are introducing this behind a flag to be able to safely test this in production without needing to rollback if anything goes wrong. So I think we only need the flag for 1-4weeks depending on what feels appropriate to ya'll (longer gives us more time to verify in monitors that we are seeing the desired decrease in 503s and other errors). We likely want to add a ticket to this epic for the removal of this flag and the search.gov API endpoint from the codebase once ready to fully commit to this endpoint update. How does that sound to you?

jilladams commented 2 months ago

In scrum, Fran / Jill agree that ~4 weeks sounds good.

SnowboardTechie commented 2 months ago

@jilladams sounds great, ticket for the deprecation captured here #19278

SnowboardTechie commented 2 months ago

https://github.com/department-of-veterans-affairs/vets-api/pull/18450 is all merged and I have successfully tested search on staging.va.gov with the feature flag both enabled and disabled. We are all set for this to deploy Monday for prod testing.

SnowboardTechie commented 2 months ago

This is now live on production, feature flag enabled and search is functioning as intended.

SnowboardTechie commented 2 months ago

@FranECross while initial testing on production went well. After a brief time we started receiving 429s due to a lack of fwd proxy being used with the new URLs. The feature flag will need to remain disabled until this is updated to use fwdproxy (probably over here)

acrollet commented 2 months ago

be aware that the fwdproxy code has a new home here: https://github.com/department-of-veterans-affairs/vsp-platform-fwdproxy - however I think prod is still deployed from the devops repo, you'll need to check with platform devops to get the current status.

jilladams commented 2 months ago

@SnowboardTechie props for handling your first prod escalation like a pro!

If I understood the Platform thread correctly today, there's a necessary step to update the AWS parameter store when we modify the endpoint. In addition to the fwdproxy updates, are you open to me scope-creeping this a bit more, to ask us to find an appropriate place to put documentation of that AWS parameter store update? Could be in code, or in product docs (somewhere in here? https://github.com/department-of-veterans-affairs/va.gov-team/tree/master/products/on-site-search/engineering).

jilladams commented 2 months ago

Noting: we may need to ask Devops about docs to point engineers to update the fwdproxy when adding new API endpoints. We had thought the fwdproxy would understand host-level changes, vs. endpoint level changes.

SnowboardTechie commented 1 month ago

I received notice that the Flipper check in the configuration for this service was causing a lot of slowdowns with environment builds. This required splitting the new endpoint into a separate service. At this time there are 4 remaining pieces to address for this to be completed:

  1. https://github.com/department-of-veterans-affairs/vets-api/pull/18693 moves the new API into the new SearchGsa service.
  2. https://github.com/department-of-veterans-affairs/vsp-infra-application-manifests/pull/3120 updates the manifests for production to overwrite environment variables so that forward proxy is utilized and we avoid rate limiting with the search API.
  3. 18693 from above is now open for review, once merged, this will allow for full testing on staging of the already updated manifests(only updated for staging/dev), which when successful will allow me to open https://github.com/department-of-veterans-affairs/vsp-infra-application-manifests/pull/3120 for review to deliver those manifests to production.
  4. Once all above is complete we are safe to start opening the feature flag back up on production for UAT
jilladams commented 1 month ago

Ah! I missed that last comment yesterday. So this means, in effect, the Search APM in Datadog is gonna end up split as well, since the SearchGSA is a new resource? Or it'll still roll up to Search APM? (Happy to chat about it if I have it all wrong / talking would be easier than writing out to clarify)

SnowboardTechie commented 1 month ago

@jilladams the way I have this implemented it will be tagging both Search Service's errors with the same label Search/Results and just slight variations in the log messages to help identify which is being used if something goes wrong. This should result in the Search APM in Datadog playing well with both.

SnowboardTechie commented 1 month ago

Upon raising search traffic to 50% to the GSA endpoint via the forward proxy today we were once again rate limited. We have emailed search.gov requesting they address the limits.

SnowboardTechie commented 1 month ago

@jilladams I was wondering if you'd like to close this ticket out and combine the final testing with #19414 ? This is fully implemented and functioning as intended until it hits rate limits so seems like a good combo to not leave this waiting for search.gov updates.

jilladams commented 1 month ago

I'm reluctant to close it until we get to 100%, just becuase it's very easy to lose track of the details / where we are with staged rollout. In the past we've cut tix for 50% > 75% > 100% -- that's annoying bookkeeping, tho. In this case: Search.gov acknowledged they have to adjust our rate limits. Would you be comfortable with us leaving this open as Blocked until we get the ack from them that they've adjusted limits? At that point, we can start the bumping up. #19414 can happen in parallel. (I'm in product sync with Michelle / @FranECross , will try to see if this makes sense to them, since 19414 isn't prioritized yet, officially.)

jilladams commented 1 month ago

(Went ahead and added label, status update to top of ticket, and moved to "Complete Pending Integration" which is the parking lot for situations like this, til we get unblocked.)

SnowboardTechie commented 1 month ago

Absolutely no problem leaving this open/blocked. Especially hearing it has caused pain before, no worries.

jilladams commented 1 month ago

Did get the greenlight on going ahead with #19414 to pull that in. Thank you!