censusreporter / census-api

The home for the API that powers the Census Reporter project.
MIT License
165 stars 49 forks source link

How can we store static ACS data outside of the database? #44

Open iandees opened 8 years ago

iandees commented 8 years ago

Problem: we have 100s of GB of ACS data sitting around in a database that's running 24/7 and a very small amount of it is queried.

Idea: Could we store data in S3 in such a way that we could do a range request for data without having to query the database?

iandees commented 7 years ago

After doing another data update for 2015 ACS data, I'm really interested in making this happen. I'd like to figure out a way to keep Census Reporter running for longer by not paying ~$40/mo in PostgreSQL data storage.

/cc @migurski @jedsundwall

migurski commented 7 years ago

What are some example queries made against this data?

JoeGermuska commented 7 years ago

@migurski it is mostly just filtering, either for an explicit list of geographies and/or a containment query (geographies of type X contained by Y).

The API supports selecting one or more specified columns from diverse tables (instead of just "all columns in the table) but that's a pretty specialized use case, probably rarely used.

For our profile pages, we gather data from many tables. We could consider preprocessing those, especially because API users have expressed interest in being able to query the whole set of profile page figures for geographies rather than having to do the aggregation themselves.

I'm pretty sure that's it. @iandees am I forgetting anything?

iandees commented 7 years ago

The way I like to describe it is that it's a big huge spreadsheet with thousands of columns (columns are grouped into tables of related columns – these are only minorly related to DBMS column/tables) and tens of thousands of rows. In Census-speak, the columns are things like "Population" and "Median Household Income" and rows are geographies like Milwaukee, WI or the ZCTA 24450. Each cell in this spreadsheet has an estimate and a measurement of error (aka uncertainty).

A B C
X e / u e / u e / u
Y e / u e / u e / u
Z e / u e / u e / u

(where A, B, C are columns and X, Y, Z are geographies)

Like Joe said, the API supports taking chunks out of this gigantic spreadsheet by letting the user specify a comma-separated list of geographies and a comma-separated list of columns. So you might send ?geo_ids=X,Z&table_ids=B you'd get the highlighted data:

A B C
X e / u
Y
Z e / u

Like Joe said, the majority of the queries come from our profile pages, which make several requests like my example here, do some math to put the data together, and show a page like this.

migurski commented 7 years ago

I guess you could PUT an enormous slab of data into S3, and then bake a number of b-tree indexes into it that would support range queries. I’ve never done anything like this — it’d be like network-mapping a SQLite file, I suppose, so maybe someone has done that? Seeking into the indexes will also be tricky, I expect that each one would be pretty large. What's querying the DB? Something on EC2?

Do you need to retrieve the full row each time, or could this behave more like a column-oriented DB?

Also for the spatial queries I guess you'd want a GIN-style inverted index of some kind of tile to each geography.

migurski commented 7 years ago

If something on EC2 is doing the querying, what about a mounted EBS volume containing database files? SQLite or even Postgres could work.

iandees commented 7 years ago

The current setup is one EC2 instance querying an RDS instance. The 300GB EBS volume attached to the RDS instance is currently the biggest expense we have and what I want to get rid of :).

migurski commented 7 years ago

Ah okay, that makes sense. I’d love to play with this; it’s an interesting problem that I toyed with on S3 back when it first came out and I wanted to see if there was a way to interact with indexed data client-side. Would you ever search on values in the table other than geography, or is it really just ID lookup?

iandees commented 7 years ago

It's only ever an ID lookup. There's currently no functionality to do things like WHERE population > 50000, only WHERE geo_id IN ('04055', '04023').

migurski commented 7 years ago

The universe of GEOIDs for ACS looks to be 7,845 — any reason to not keep each of these in separate S3 objects so they can be requested individually? Then all that's left is a spatial index.

migurski commented 7 years ago

…or, should this process also work for the much-larger 5-year estimates as well?

iandees commented 7 years ago

There are many, many more in the 5-year, but splitting into files per geoid does make sense for the kinds of queries people would tend to do (focused on geographies).

migurski commented 7 years ago

Right, makes sense. I’m planning to pick at a basic b-tree for GEOIDs as well as a quadtree-based spatial index for geographies. It would require something like GEOS or Shapely client-side to work.

iandees commented 7 years ago

You can probably leave out geospatial indexing for now. We can pre-generate vector tiles for all of TIGER.

JoeGermuska commented 7 years ago

FWIW, this is basically how we built census.ire.org. http://census.ire.org/data/bulkdata.html

EDIT: I guess this page is closer to the mark for the JSON-per-geoid thing http://census.ire.org/docs/javascript-library.html

migurski commented 7 years ago

Nice! Also similar to how I whacked together census-tools for the 2000 decennial: https://github.com/migurski/census-tools

migurski commented 7 years ago

I took a swing at this on the flight; here's a gist: https://gist.github.com/migurski/659cafc98779fadb5c1e79533206372d

The files on disk are encoded Latin-1 like most Census stuff, and the scripts use Python 3 for the unicode goodness with no 3rd-party libs. The index file is a simple 3-column CSV, with the GEOID, byte offset, and byte length of a record in each column. There's no particular reason the records have to be CSV, by the way. I've tested this locally with all 757MB of 2015 ACS 1-year and it’s pretty speedy retrieving records.

I have a proper git repo for this locally.

migurski commented 7 years ago

I wrote a small Flask app to serve this index, and posted it to Heroku with data living in two places. Responses include remote data access times in headers.

With the indexes and data living on Benzene (openstreetmap.us), retrieving two geographies takes 3.75 total seconds:

> curl -si 'https://census-remote-index.herokuapp.com/?geo_ids=05000US06001,04000US06' | cut -c 1-175
HTTP/1.1 200 OK
X-Index-Times: 0.769186 0.005368
X-Data-Times: 1.981628 0.996756
Content-Length: 333663
Content-Type: text/plain

ACSSF,CA,040,00,0000001,,,,,06,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,04000US06,California,,,,583222,215206,39144818,19440558,1282484,1292517,1288636,790063,543195,300586,29974
ACSSF,CA,050,00,0000013,,,,,06,001,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,05000US06001,"Alameda County, California",,,,26457,9787,1638215,803485,50045,51267,47282,28074,20078,89

With the indexes and data living on S3, retrieving two geographies takes 0.45 total seconds:

> curl -si 'https://census-remote-index.herokuapp.com/?geo_ids=05000US06001,04000US06' --compress | cut -c 1-175
HTTP/1.1 200 OK
X-Index-Times: 0.073364 0.004485
X-Data-Times: 0.260381 0.115801
Content-Length: 333663
Content-Type: text/plain

ACSSF,CA,040,00,0000001,,,,,06,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,04000US06,California,,,,583222,215206,39144818,19440558,1282484,1292517,1288636,790063,543195,300586,29974
ACSSF,CA,050,00,0000013,,,,,06,001,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,05000US06001,"Alameda County, California",,,,26457,9787,1638215,803485,50045,51267,47282,28074,20078,89

With the indexes and data living at http://localhost, retrieving two geographies takes 0.12 total seconds, as a baseline:

> curl -si 'http://127.0.0.1:5000/?geo_ids=05000US06001,04000US06' | cut -c 1-175
HTTP/1.1 200 OK
X-Index-Times: 0.028033 0.004092
X-Data-Times: 0.060889 0.025159
Content-Length: 333663
Content-Type: text/plain

ACSSF,CA,040,00,0000001,,,,,06,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,04000US06,California,,,,583222,215206,39144818,19440558,1282484,1292517,1288636,790063,543195,300586,29974
ACSSF,CA,050,00,0000013,,,,,06,001,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,05000US06001,"Alameda County, California",,,,26457,9787,1638215,803485,50045,51267,47282,28074,20078,89

Stuff to think about: gzipping records on S3? Storing them in JSON or GeoJSON with geometries? Is it fast enough?

iandees commented 7 years ago

Thanks for doing this, @migurski. This is really great. I will take a closer look tomorrow, but this seems like a great start.

Some first thoughts:

More tomorrow.

migurski commented 7 years ago

Interestingly (to me), the index lookup times are only about 20% of the total so it might not be a huge savings to pre-slice the data by geography? The data this demo is hosting has about ~150KB per geography, but its columns are not marked in any sensible way. I suppose that a JSON blob per record, perhaps gzipped, might make more sense.

iandees commented 7 years ago

I tried the gzipped JSON blob idea. Example is posted on S3 here. It's 249KB gzipped on S3 and 1.1MB expanded in the browser.

Timing a fetch, decompress, parse says it takes about 0.25s each step, for a total of just under 0.75s on my laptop over wireless:

$ time curl -s https://embed.censusreporter.org/test/04000US02.json | gzip -d | jq .tables.B00001.estimate.B00001001 && echo
19940
curl -s https://embed.censusreporter.org/test/04000US02.json  0.03s user 0.02s system 16% cpu 0.254 total
gzip -d  0.01s user 0.00s system 3% cpu 0.254 total
jq .tables.B00001.estimate.B00001001  0.05s user 0.01s system 21% cpu 0.262 total

(Note that embed.censusreporter.org is a CloudFront distribution that points to the S3 bucket)

So this is quite a bit slower, but it's easier to understand and leaves room in the JSON for other metadata about the geography.

JoeGermuska commented 7 years ago

@iandees What's the plan for the full text search feature in this model?

iandees commented 7 years ago

We can still have a database, but it doesn't need to be 300+ GB if we can put the bulk data on S3.