GSA-TTS / FAC

GSA's Federal Audit Clearinghouse
Other
20 stars 6 forks source link

API / DB performance and stability #4285

Open danswick opened 2 months ago

danswick commented 2 months ago

Following difficulties deploying API v1.1.1, we discussed whether the current materialized view approach is still the right one. Deploying the full stack depends on a precise order of operations that can be fragile and prone to subtle failures.

Related: https://github.com/GSA-TTS/FAC/issues/4039 and friends.

Tasks

- [x] Describe the underlying issues in more detail.
- [x] Discuss alternative strategies (read replica, some other startup process, etc).

Solution pathway

We've decided to move the API to the secondary database (fac-snapshot-db, or "DB2"). This solves multiple performance and load issues within the application.

- [x] Design, implement process for copying data to secondary database (replace `MATERIALIZED VIEW`)
- [x] Move API definition to DB2
- [x] Split API definitions so that management commands (e.g. Tribal access management) remain on DB1.
- [ ] https://github.com/GSA-TTS/FAC/issues/4407
- [x] Remove the Tribal access administrative API
- [x] Update E2E tests to not use the API (#4398)
- [x] Update Advanced Search to use DB2
- [x] Add gate code to all SQL/API startup scripts (e.g. make sure the right table is present or fail fast)
- [x] Remove all the spurious/extra `NOTIFY` statements in the SQL pre/post
- [x] Duplicate `dissemination_general` into `api_general` or similar in DB2 (see below)
- [x] Remove `sling` from API standup process
- [ ] Deprecation plan for older APIs + comms plan about moving to this new version of the API (from 1.1.0 to 2.0)
- [x] Remove old `census_` tables from early cog/over work
- [ ] Ticket documenting the new API
- [x] Move the init for curation tracking into the pre-sequence for fac-db
- [x] Remove unnecessary `CREATE SCHEMA`
- [ ] Add `suppressed_` migration error tables to api v2
- [ ] Ticket the investigation/improved generation of XLSX/CSV documents off of the guaranteed-public data tables (bulk data)
- [x] Remove extraneous `ALTER` permissions code; only `GRANT` after-the-fact
- [x] Make sure permissions (`GRANT`) comes late, not early.
- [x] Sync run.sh/.profile, possibly have them share code?
- [x] Sync env setup between local/cgov, possibly have them share code?
- [ ] Ticket the refactoring of the config now that we have VCAP_SERVICES in the local env.
- [ ] #4039
- [ ] TIcket improving API test script to cover checks for suppressed data, etc. using a non-privileged key.
- [ ] Ticket development of a router for the databases, so that models are routed to db1/db2 on a consistent manner, as opposed to handling public_100.combined in a one-off manner
- [x] Update docker-compose-web to match docker-compose
- [x] Update TF with variable changes to reflect new state
- [ ] Ticket refactoring `fac-backup-util.sh` and `nightly-...` so they are in `util` and source shared functions.
- [ ] Ticket adding a new test that runs after replication to make sure that data is properly visible/suppressed when requested via the API (possibly add to nightly data sling)
- [ ] Fix remaing on `0144` sql. (Trivial rename.)

Example of gating code for standup: https://github.com/GSA-TTS/FAC/blob/cf6b5c909251b45f08ed96e677d53c88337c328e/backend/dissemination/sql/fac-snapshot-db/post/020_api_v1_1_0.sql#L1

Spurious NOTIFY statements are everything except for the last one in finalize.

Our nightly backup (and more importantly, our deploys) will want to DROP/recreate dissemination_general in DB2 (fac-snapshot-db). So, we need to make yet one more copy of the dissemination_* tables for the api_v1_1_0 VIEWs to point at. (That is, a copy we can create/tear down/etc.)

Work is underway in https://github.com/GSA-TTS/FAC/tree/jadudm/api-perf

jadudm commented 1 month ago

We have determined that there are multiple problems with our current API design.

Problem 1: Users downloading all of the data

We provide a web-based search (for interactive use) and an API (for systems/application use). When it comes to accessing all of the data, the API is the only tool users have. While it may be that some users do not have to download all of the data all of the time, there are users for whom having multiple years of data (or "all of it") is necessary.

This means we have people who are pulling several million rows (5M in federal_awards alone) 20K rows at a time.

To be clear: We are not saying "people using our API is a problem." Far from it! However, we simply aren't optimized for this kind of use-pattern, as further investigation makes clear...

The underlying problem: linear search

This is the underlying problem:

image

When someone uses the API to download the first 20K rows, the EXPLAINed cost is around 1200 DBUs (database units; EXPLAIN, according to Postgres documentation, is effectively an arbitrary/unitless measure). Unfortunately, when someone fetches the next 20K rows, the engine needs to scan the first 20K rows, and then download the next 20K (this costs around 2800 DBUs). Likewise, we then scan 40K rows, and download the next 20K... costing around (3900 DBUs)... and as a result, the cost of each subsequent set of rows goes up. There is a scan cost, and a download cost, and the scan cost just keeps going up.

The total cost to download 4M rows, when executing via the API on a local development environment, is 50M DBUs. That is expensive. In this particular instance, the local dev environment might be more performant than our RDS environment. Certainly, there is less contention. So, we can assume that the numbers presented here represent a lower bound on the costs and performance we will see in production.

We do index the DB. Unfortunately, the Postgres B-tree index does not improve the performance of linear scans; to optimize for the case described above, we would need a counted B-tree index. Apparently, Oracle has one. Postgres does not. So, we cannot "simply" fix this problem with a new index.

A possible solution: adding batches

We can index the table(s) smartly for bulk download. We could do the following:

  1. Index by batch. We could build an index on div(row_id, 20000).
  2. Next, we could provide an API RPC that lets users get_by_batch(batch_no).

When this approach is implemented locally, the cost to download an arbitrary batch becomes approximately 45 DBUs. The cost to download 4M rows is then 9K DBUs, or a 5500x improvement over linearly scanning the DB via API.

Problem #2: VIEWs

The reason we can't "just" implement batched API downloads is because we don't expose tables, we expose VIEWs.

And, worse: every VIEW in our API includes a JOIN.

And, we cannot apply indexes to a VIEW.

So... what that means is that even if we apply smart indexing to our underlying tables, our JOINed API VIEWs are not going to see any benefit. And/or, if we chose to expose the batches, they would not be "the same" as the data exposed via the current API. We would be missing some of the values that we currently provide via the JOIN.

Problem #3: MATERIALIZED VIEW (or MV)

Related, we have a MV that we use for our Advanced Search. We would like to expose this via API, as it provides a pre-computed, 4-way join across general, federal_awards, findings, and passthrough. It would turn most queries into simple SELECT statements for all of our users. (This is why Advanced Search is so performant: it only looks at the MV).

Unfortunately, we cannot point at VIEW at a MATERIALIZED VIEW, meaning that our PostgREST-powered API can't easily expose this MV. (We believe this is true, based on experimentation. Perhaps if we just put the MV on the same schema as the API we could expose it... but, I don't like that solution.)

Problem 4: Everything is on one DB

Currently, our intake and our dissemination are all on one database. This means when the API traffic gets large, we impact intake and the web-based search performance.

The fix for this would be a second database.

Proposed solution: Use a second DB and TABLEs

In a picture, the proposed/spiked solution in jadudm/api-perf looks like the following:

image

In a nutshell:

  1. This approach treats the FAC intake as a data pipeline that terminates in the dissemination_ tables. Although they're not quite 1NF, they're consistent, and changing them would be very disruptive. So, lets leave those alone.
  2. Create a nightly process that generates a set of public tables. Those tables will contain only public data. They will look very, very similar to our current API. Put another way: these tables can pre-compute the JOINs that are currently baked into our API.
  3. Put the new public tables on our second/backup database. Currently, it is being used for pre-deploy snapshots. This change means we would replicate our intake DB nightly, and then generate public tables off of that data, entirely in the second DB. (This means the workload of data mangling all falls on the second DB, instead of our primary DB.)
  4. Instead of generating the MV nightly, we'll generate an equivalent TABLE called combined.
  5. Leave Basic Search pointed at the dissemination_ tables. This means users of Basic Search always see current, up-to-the-second results. Submitters can see that we received their submissions in realtime. (We have no indication, at this time, that Basic Search represents any kind of performance problem.)
  6. Point Advanced Search at the new, public combined table. Given that it is identical to the MV, this should not require any substantial app changes to maintain Advanced Search "as is."
  7. Point the API at the second database. We can maintain the existing APIs, because we'll be replicating the dissemination_ tables as-is, as well as building new public tables for an improved, more performant API.
  8. If we wish to generate full data dumps at a later point, we will have tables "right-shaped" for those dumps ready-to-go.
  9. Authentication and access control for suppressed data becomes simpler. Why? Because we now have suppressed data in a completely separate set of tables. (We can do table-level access control, instead of row-level.)
  10. We can rewrite api_v1_0_3 and api_v1_1_0 to use the new tables, and eliminate the JOINs. We might even be able to backport batch downloads to api_v1_1_0, but it is probably better to encourage a move to a new public_api_v1_0_0.

There are a few impacts.

  1. The API is currently up-to-the-second. Now, the API will be up-to-date on a nightly basis. Given the nature of Single Audits, this should be acceptable. That is, our Federal users are generally updating systems on a daily or weekly basis. Having a nightly refresh schedule should not impact Federal systems integration.
  2. We will need to generate a similar set of tables for suppressed data. This is not difficult, but ultimately will require some API changes. (api_v1_1_0 will continue to work as-is. Any future APIs will have a separate space for suppressed data.)

Conclusion

The intersection of multiple performance-impacting problems suggests that any one fix is not going to address the underlying issues.

  1. "Just" adding batches to our existing tables does not work, because we expose them via VIEWs that include JOINs, and therefore we cannot realize a benefit that is 1:1 compatible with our existing API shape.
  2. "Just" moving to a second DB does not address the fact that our APIs all have VIEWs baked in.
  3. "Just" rewriting VIEWs to tables does not address the fact that we have everything on one DB, and we will still be stressing a single engine with both intake and dissemination traffic.

The branch jadudm/api-perf explores this fully, and performance comparison data will be provided in a subsequent comment. (Or: the proof of the pudding will be in the eating

jadudm commented 1 month ago

What follows is a discussion of performance testing on a local development machine, using 4M rows of data in the federal_awards table.

EXPLAINed performance

Enabling the observability features of PostgREST, it is possible to get an EXPLAIN for each API call.

Under API v1.1.0 (api110) and a new API of new, public tables (public100), I fetch 4M rows in 20K batches. For each 20K fetch (with a corresponding OFFSET), I capture the "total cost" of the query. Summing all of these total costs gives us a cost in DBUs for downloading the entire database.

units api110 public100 batches
DBUs 36742666 26626400 1130022
Relative 32x 23x 1x

(I'm rounding here. I really don't care about anything to the right of the decimal point, given the order of magnitude of the numbers involved.)

It cost 37M DBUs to download all 4M rows via api110, 27M DBUs to download via public100, and 1M DBUs to download via batches.

The difference between 37M and 27M is, I believe, because public100 as an API has no JOIN statements. That means there's a 30% improvement in fetch improvement by pre-computing the JOIN statements.

We also add a batch_number column in public100, and this is a pre-computed value div(row, 20000). It is then indexed. As a result, it is possible to write a query like

GET https://api-url/federal_awards?batch_number=eq.200

Because it is indexed, fetching a batch anywhere in the 4M rows has the same cost. Therefore, it is roughly 30x less expensive than api110.

If people are going to download the entire dataset via API, this is the least expensive way we can offer it.

Timed performance

On the same local dev machine (meaning that networking costs are negligible, and we have more CPU and RAM available than in our RDS instances), we can time the amount of time it takes to fetch 4M rows of data via the API. Again, all of these values are likely representative, and real-world performance will likely be worse. However, the relative timings are likely to be consistent.

units api110 public100 batches
Seconds 188 56 27
Relative 6x 2x 1x

In terms of actual time, it takes just over 3 minutes to download all 4M rows via api110. Again, this is with the script doing the downloading running on the same host that the development version of the FAC is running on. Bandwidth is effectively infinite, latency is effectively zero.

It takes roughly 1 minute to download all of the data using the API without any JOIN statements.

When we download using the new batches, it takes just under 30 seconds.

It is, therefore, 6x faster to download the data via batches than the current API, and roughly 2x faster than using the optimized tables directly (with OFFSET values).

Conclusion

It is possible to improve the API for the FAC. We can do so while maintaining a roughly consistent table shape (e.g. the same tables), adding columns (to improve search possibilities), and in doing so, provide optimization for use cases we see in the wild (e.g. downloading of all data via the API).

Improvements based on EXPLAIN values are as much as 30x, and clock time as much as 6x. Testing in the cloud.gov environment to come.

The numbers above were generated with the rough-and-ready script here.

jadudm commented 1 month ago

As a note from conversation today: running a snapshot backup as part of a deploy will likely collide with api_v1_1_0, because it points at the dissemination_ tables in the second database. I should use sling to create a copy of the dissemination_ tables and use that for the API.

In other words, even though it is a "no-op," it is part of the data pipeline: the dissemination tables that are backed up in to fac-snapshot-db should not be actively used. They are a stepping-stone to further pipeline work.

E.g.

Backup tables from DB1->DB2 --> Copy those tables into various forms --> Point API at those tables

is what we want.

danswick commented 5 days ago

Capturing notes from group discussion:

We would like to break this PR up into chunks and deploy in stages to help with both review and testing. Here's the initial division we came up with:

  1. Admin API tear-out and replacement. @rnovak338.
  2. API schema and backup. @sambodeme and @rnovak338
  3. Stand up new API pointing at DB2 and run simultaneously with the current version of the API.
    1. The simultaneous approach could be challenging, though not impossible: we'd need another PostgREST instance for the new DB (terraform change) and make some changes to our data.gov configuration to support both.
  4. Bulk download functionality + partitions/20k chunks.
  5. API performance monitoring (TBD)