bibanul / tiger-geocoder

US Geocoder module that works on top of TIGER Line data file from US Census bureau
http://www.tigergeocoder.com
60 stars 9 forks source link

Why is the database so large? #4

Closed paulyoder closed 9 years ago

paulyoder commented 9 years ago

I noticed the database size is around 100Gb. I'm wondering why it's so much larger than the sqlite3 database that https://github.com/geocommons/geocoder uses. It also downloads the TIGER data and populates an sqlite3 database, but it's database is only 4 or 5 Gb.

The reason I ask is because I can run geocoder on a small ec2 instances that only costs $10 or $20 a month whereas the rc.large instance you recommend is over $160 a month.

Thanks so much! Paul

bibanul commented 9 years ago

Hi Paul,

It's not the DB size that gives you the needed performance, it's the RAM. You can't run a geocoder in a small instance, that's why we restricted the minimum instance size to ensure the product we put out is usable. It's unrealistic to expect high volume geocoding on 1GB of ram or less. If you want to run 1 concurrent request against the geocoder a small instance probably works, however the data swapping across different States will definitely hinder the response times. We designed this for large volume and fast response, including Intersection geocoding which is heavy. Also we have custom code we put in to also handle caching via Redis, to size up that it's RAM again. The speed gain is unbelievable, say you geocode an address and get it out in 100ms, if you hit it again we take it from Redis and it's 1ms. That scales a LOT but requires RAM to do so. Geocodes are safe to cache for months for example.

The minimum we played with and found acceptable for occasional use is t2.medium and 4GB of ram. That's the lower acceptable limit to get any decent response time out of it of sub second. Again there are many ways to skin the cat, if you geocode 1 state mostly, you can get away with a small instance. We assume people want to geocode across the US in no particular order.

As for why postgres takes 100gb, no clue. That's all the data and indexes and everything needed. We are hands off that TIGER database layout, the guys at PostGIS designed the database structure.

As far as expectations, I don't think it's realistic to expect to get away with $30/mo and have a Business class geocoding machine. Google charges $100k per year and still not unlimited requests. Other machines in Azure charge $30/hr lol. yeah $30.

I have tried many combinations including running on Heroku Postgres, they don't have SSDs sadly so performance dies right fast.

paulyoder commented 9 years ago

@bibanul thanks for the explanation. I didn't mean to complain. The geocommons/geocoder project I referenced is stale and hasn't been worked on for a few years. And I have been banging my head trying to get it to use the 2014 TIGER data.

So I really appreciate the time you put into making it dead simple to get a server up and running on Amazon. I will gladly pay Amazon a little more for a bigger server so that I don't have to waste my time trying to build it myself.

Thanks!

bibanul commented 9 years ago

Yeah no worry. I know the pain. it took me several weeks to get the data imported alone and learn the pitfalls. Things such as census FTP may time out on one file and you'll not notice, and then ask, why is a specific address not geocoding, and realize the TIGER LINE data for one parcel is missing.

I had the Census FTP time out a lot over 1 week it takes to download all TIGER files.

Im using the same product myself in EC2 for a different business I have. And I'm geocoding at a rate of 1MM per day :). I too started with Google API and moved on. The idea has been to offer a drop in replacement for Google API so you can geocode very large daily volumes.

Which reminds me its time to look for TIGER Data update and re-do the EC2 image heh. I saw the 2014 TIGER but they had some issues with the format.