Open danswick opened 2 months ago
We have determined that there are multiple problems with our current API design.
We provide a web-based search (for interactive use) and an API (for systems/application use). When it comes to accessing all of the data, the API is the only tool users have. While it may be that some users do not have to download all of the data all of the time, there are users for whom having multiple years of data (or "all of it") is necessary.
This means we have people who are pulling several million rows (5M in federal_awards
alone) 20K rows at a time.
To be clear: We are not saying "people using our API is a problem." Far from it! However, we simply aren't optimized for this kind of use-pattern, as further investigation makes clear...
This is the underlying problem:
When someone uses the API to download the first 20K rows, the EXPLAIN
ed cost is around 1200 DBUs (database units; EXPLAIN
, according to Postgres documentation, is effectively an arbitrary/unitless measure). Unfortunately, when someone fetches the next 20K rows, the engine needs to scan the first 20K rows, and then download the next 20K (this costs around 2800 DBUs). Likewise, we then scan 40K rows, and download the next 20K... costing around (3900 DBUs)... and as a result, the cost of each subsequent set of rows goes up. There is a scan cost, and a download cost, and the scan cost just keeps going up.
The total cost to download 4M rows, when executing via the API on a local development environment, is 50M DBUs. That is expensive. In this particular instance, the local dev environment might be more performant than our RDS environment. Certainly, there is less contention. So, we can assume that the numbers presented here represent a lower bound on the costs and performance we will see in production.
We do index the DB. Unfortunately, the Postgres B-tree index does not improve the performance of linear scans; to optimize for the case described above, we would need a counted B-tree index. Apparently, Oracle has one. Postgres does not. So, we cannot "simply" fix this problem with a new index.
batches
We can index the table(s) smartly for bulk download. We could do the following:
batch
. We could build an index on div(row_id, 20000)
.get_by_batch(batch_no)
.When this approach is implemented locally, the cost to download an arbitrary batch becomes approximately 45 DBUs. The cost to download 4M rows is then 9K DBUs, or a 5500x improvement over linearly scanning the DB via API.
VIEW
sThe reason we can't "just" implement batched API downloads is because we don't expose tables, we expose VIEW
s.
And, worse: every VIEW
in our API includes a JOIN
.
And, we cannot apply indexes to a VIEW
.
So... what that means is that even if we apply smart indexing to our underlying tables, our JOIN
ed API VIEW
s are not going to see any benefit. And/or, if we chose to expose the batches, they would not be "the same" as the data exposed via the current API. We would be missing some of the values that we currently provide via the JOIN
.
MATERIALIZED VIEW
(or MV
)Related, we have a MV
that we use for our Advanced Search. We would like to expose this via API, as it provides a pre-computed, 4-way join across general
, federal_awards
, findings
, and passthrough
. It would turn most queries into simple SELECT
statements for all of our users. (This is why Advanced Search is so performant: it only looks at the MV
).
Unfortunately, we cannot point at VIEW
at a MATERIALIZED VIEW
, meaning that our PostgREST-powered API can't easily expose this MV. (We believe this is true, based on experimentation. Perhaps if we just put the MV on the same schema as the API we could expose it... but, I don't like that solution.)
Currently, our intake and our dissemination are all on one database. This means when the API traffic gets large, we impact intake and the web-based search performance.
The fix for this would be a second database.
TABLE
sIn a picture, the proposed/spiked solution in jadudm/api-perf
looks like the following:
In a nutshell:
dissemination_
tables. Although they're not quite 1NF, they're consistent, and changing them would be very disruptive. So, lets leave those alone.JOIN
s that are currently baked into our API.MV
nightly, we'll generate an equivalent TABLE
called combined
. dissemination_
tables. This means users of Basic Search always see current, up-to-the-second results. Submitters can see that we received their submissions in realtime. (We have no indication, at this time, that Basic Search represents any kind of performance problem.)combined
table. Given that it is identical to the MV
, this should not require any substantial app changes to maintain Advanced Search "as is."dissemination_
tables as-is, as well as building new public tables for an improved, more performant API.api_v1_0_3
and api_v1_1_0
to use the new tables, and eliminate the JOIN
s. We might even be able to backport batch downloads to api_v1_1_0
, but it is probably better to encourage a move to a new public_api_v1_0_0
.There are a few impacts.
api_v1_1_0
will continue to work as-is. Any future APIs will have a separate space for suppressed data.)The intersection of multiple performance-impacting problems suggests that any one fix is not going to address the underlying issues.
VIEW
s that include JOIN
s, and therefore we cannot realize a benefit that is 1:1 compatible with our existing API shape.VIEWs
baked in.VIEW
s to tables does not address the fact that we have everything on one DB, and we will still be stressing a single engine with both intake and dissemination traffic.The branch jadudm/api-perf
explores this fully, and performance comparison data will be provided in a subsequent comment. (Or: the proof of the pudding will be in the eating
What follows is a discussion of performance testing on a local development machine, using 4M rows of data in the federal_awards
table.
EXPLAIN
ed performanceEnabling the observability features of PostgREST, it is possible to get an EXPLAIN
for each API call.
Under API v1.1.0 (api110
) and a new API of new, public tables (public100
), I fetch 4M rows in 20K batches. For each 20K fetch (with a corresponding OFFSET
), I capture the "total cost" of the query. Summing all of these total costs gives us a cost in DBUs for downloading the entire database.
units | api110 | public100 | batches |
---|---|---|---|
DBUs | 36742666 | 26626400 | 1130022 |
Relative | 32x | 23x | 1x |
(I'm rounding here. I really don't care about anything to the right of the decimal point, given the order of magnitude of the numbers involved.)
It cost 37M DBUs to download all 4M rows via api110
, 27M DBUs to download via public100
, and 1M DBUs to download via batches.
The difference between 37M and 27M is, I believe, because public100
as an API has no JOIN
statements. That means there's a 30% improvement in fetch improvement by pre-computing the JOIN
statements.
We also add a batch_number
column in public100
, and this is a pre-computed value div(row, 20000)
. It is then indexed. As a result, it is possible to write a query like
GET https://api-url/federal_awards?batch_number=eq.200
Because it is indexed, fetching a batch anywhere in the 4M rows has the same cost. Therefore, it is roughly 30x less expensive than api110
.
If people are going to download the entire dataset via API, this is the least expensive way we can offer it.
On the same local dev machine (meaning that networking costs are negligible, and we have more CPU and RAM available than in our RDS instances), we can time the amount of time it takes to fetch 4M rows of data via the API. Again, all of these values are likely representative, and real-world performance will likely be worse. However, the relative timings are likely to be consistent.
units | api110 | public100 | batches |
---|---|---|---|
Seconds | 188 | 56 | 27 |
Relative | 6x | 2x | 1x |
In terms of actual time, it takes just over 3 minutes to download all 4M rows via api110
. Again, this is with the script doing the downloading running on the same host that the development version of the FAC is running on. Bandwidth is effectively infinite, latency is effectively zero.
It takes roughly 1 minute to download all of the data using the API without any JOIN
statements.
When we download using the new batches, it takes just under 30 seconds.
It is, therefore, 6x faster to download the data via batches than the current API, and roughly 2x faster than using the optimized tables directly (with OFFSET
values).
It is possible to improve the API for the FAC. We can do so while maintaining a roughly consistent table shape (e.g. the same tables), adding columns (to improve search possibilities), and in doing so, provide optimization for use cases we see in the wild (e.g. downloading of all data via the API).
Improvements based on EXPLAIN
values are as much as 30x, and clock time as much as 6x. Testing in the cloud.gov environment to come.
The numbers above were generated with the rough-and-ready script here.
As a note from conversation today: running a snapshot backup as part of a deploy will likely collide with api_v1_1_0
, because it points at the dissemination_
tables in the second database. I should use sling
to create a copy of the dissemination_
tables and use that for the API.
In other words, even though it is a "no-op," it is part of the data pipeline: the dissemination tables that are backed up in to fac-snapshot-db
should not be actively used. They are a stepping-stone to further pipeline work.
E.g.
Backup tables from DB1->DB2 --> Copy those tables into various forms --> Point API at those tables
is what we want.
Capturing notes from group discussion:
We would like to break this PR up into chunks and deploy in stages to help with both review and testing. Here's the initial division we came up with:
Following difficulties deploying API v1.1.1, we discussed whether the current materialized view approach is still the right one. Deploying the full stack depends on a precise order of operations that can be fragile and prone to subtle failures.
Related: https://github.com/GSA-TTS/FAC/issues/4039 and friends.
Tasks
Solution pathway
We've decided to move the API to the secondary database (
fac-snapshot-db
, or "DB2"). This solves multiple performance and load issues within the application.Example of gating code for standup: https://github.com/GSA-TTS/FAC/blob/cf6b5c909251b45f08ed96e677d53c88337c328e/backend/dissemination/sql/fac-snapshot-db/post/020_api_v1_1_0.sql#L1
Spurious
NOTIFY
statements are everything except for the last one infinalize
.Our nightly backup (and more importantly, our deploys) will want to DROP/recreate
dissemination_general
in DB2 (fac-snapshot-db
). So, we need to make yet one more copy of thedissemination_*
tables for theapi_v1_1_0
VIEW
s to point at. (That is, a copy we can create/tear down/etc.)Work is underway in https://github.com/GSA-TTS/FAC/tree/jadudm/api-perf