geocommons / geocoder

Modular Street Address Geocoder
http:///www.geocommons.com
GNU Lesser General Public License v3.0
395 stars 89 forks source link

TigerLine 2011? #29

Open aprajnaparamita opened 12 years ago

aprajnaparamita commented 12 years ago

Hi!

I was merrily going about generating the geocoder.db file and I happened to download Tiger/Line 2011:

ftp://ftp2.census.gov/geo/tiger/TIGER2011/

It built successfully and I was able to install the gem. I went on to try to generate the data file as the README says and:

development:jjeffus@~/dev/geocoder[master]: build/tigerimport ~/geocoder.db /Volumes/Blimpy/TigerLine/ ls: /Volumes/Blimpy/TigerLine////tl*_edges.zip: No such file or directory

Seeing a tiger2009_import I figured maybe things had changed and the readme hadn't been updated. So:

development:jjeffus@~/dev/geocoder[master]: build/tiger2009_import ~/geocoder.db /Volumes/Blimpy/TigerLine/ ls: /Volumes/Blimpy/TigerLine//[0-9]*: No such file or directory

I'm guessing that the directory structure has changed again? The script definitely is expecting a very different structure. I'm poking around trying to figure out how to change the script. But obviously someone has had this problem before. So maybe it's a futile cause? Has it changed so significantly that it would require a complete rewrite?

aprajnaparamita commented 12 years ago

It seems this is a problem that's been active since at least the 2010 data came out:

https://groups.google.com/d/msg/geocommons-geocode/PH6g20m7kaU/Z_W065lbyjkJ

It looks like the data files in the 2011 distribution are kept in the same directories, instead of spread out over different states and counties. The principal directories are:

EDGES ADDR FEATNAMES

These just contain the zip files directly. The script tiger2009_import seems to do this:

# Foreach ZIP in FEATNAMES ADDR EDGES
unzip -q $ZIP -d $TMP
# We're building SQL here so create the tables
cat ${SQL}/setup.sql > output.sql
# Foreach file in EDGES do
shp2sqlite -aS ${TMP}/*_${file}.shp tiger_${file} >> output.sql
# Foreach file in FEATNAMES and ADDR
shp2sqlite -an ${TMP}/*_${file}.dbf tiger_${file} >> output.sql
# Now do transformations using temporary tables
cat ${SQL}/convert.sql >> output.sql
# Now run the SQL
cat ouput.sql | sqlite3 $DATABASE

Would anyone be offended if I rewrote this using Ruby for Tiger2011?

aprajnaparamita commented 12 years ago

I get why it's written as a single long pipe command, it's an elegant solution to the problem of the size of the data. I have the following script written in Ruby.

https://gist.github.com/1631758

This took roughly two hours on my quad core Mac to create the loading.sql file. It was roughly 99Gb. Unfortunately it seems to get stuck on the "cat loading.sql | sqlite3 #{database}" part. I gave it 16 hours, after which it was stuck using 1% of the CPU. Very strange. Probably need to rewrite it to use a single long pipe.

alexsenxu commented 12 years ago

I could be wrong but the state/county organization from TIGER/Line 2009 might be used in further steps after the import step.

aprajnaparamita commented 12 years ago

I just double checked and I'm not seeing any place where it's used. It looks like it simply imports the shp and dbf files into the database without regard to the folder names / placement. Of course, this shell script is pretty dense stuff for me. Here's my attempt to rewrite the above script while maintaining the whole pipe mechanism.

https://gist.github.com/1694885

I haven't ran it since I just decided to use a commercial product for geocoding. But I hope we can get to the bottom of this and update geocoder to the new database. I'm going to work on it this week-end.

alexsenxu commented 12 years ago

Good call. Just out of curiosity, what commercial geocoding software (or service) are you using?

I am working on porting TIGER/Line2011 onto HDFS instead of a database. Will post update once there are progress.

aprajnaparamita commented 12 years ago

Well, the data I was working on was 90% just city/state/zip. So I used for those: http://www.zipcodedownload.com/

Then i used the geocoder gem with Bing maps for the last 10%: https://github.com/alexreisner/geocoder

This is not ideal but I think I ended up with pretty high quality results. I'm hoping to get this geocoder database fixed, it doesn't have any usage limits and it's not locked down by any corporation or government.

aprajnaparamita commented 12 years ago

I can't believe it didn't occur to me but all you need to do is use the tiger_import script. Import for 2011 goes like this:

First follow the Prerequisites section of the Geocoder man page (https://github.com/geocommons/geocoder) but skip "Additionally, you will need a custom build of the ‘sqlite3-ruby’ gem". It's not needed anymore. Next build the geocoder gem:

git clone git://github.com/geocommons/geocoder.git
cd geocoder
make install
gem install Geocoder-US-2.0.2pre.gem

On Mac OS X it will fail at "make install" with "ld: symbol(s) not found for architecture x86_64". Here's the fix:

cd src/shp2sqlite
make -f Makefile.macosx
cd ../..
make install
gem install Geocoder-US-2.0.2pre.gem
ruby -rgeocoder/us -e ''
# This last command will fail with a nasty error like:
# /Users/jjeffus/.rvm/rubies/ruby-1.9.2-p290/lib/ruby/site_ruby/1.9.1/geocoder/us/database.rb:96:in `load_extension': dlopen(/Users/jjeffus/.rvm/rubies/ruby-1.9.2-p290/lib/ruby/site_ruby/1.9.1/geocoder/us/sqlite3.so, 10): image not found (RuntimeError)
# To get a working geocoder/us you need to take the filename after dlopen( and copy the correct file there. In this case
# the file is: /Users/jjeffus/.rvm/rubies/ruby-1.9.2-p290/lib/ruby/site_ruby/1.9.1/geocoder/us/sqlite3.so
# so: cp lib/geocoder/us/sqlite3.so /Users/jjeffus/.rvm/rubies/ruby-1.9.2-p290/lib/ruby/site_ruby/1.9.1/geocoder/us/sqlite3.so

After you have successfully built geocoder::us please do the next from geocoder root.

mkdir data
mkdir database
cd data
wget -nd -r -A.zip ftp://ftp2.census.gov/geo/tiger/TIGER2011/ADDR/
wget -nd -r -A.zip ftp://ftp2.census.gov/geo/tiger/TIGER2011/FEATNAMES/
wget -nd -r -A.zip ftp://ftp2.census.gov/geo/tiger/TIGER2011/EDGES/
cd ..

Now open "build/tiger_import" in the text editor of your choice and change:

SHP2SQLITE=../src/shp2sqlite/shp2sqlite 
# to
SHP2SQLITE="$BASE/shp2sqlite"

Now we can finally do the import:

build/tiger_import database/geocoder.sqlite3 data
chmod +x build/build_indexes
build/build_indexes database/geocoder.sqlite3
sudo gem install text --no-rdoc --no-ri
bin/rebuild_metaphones database/geocoder.sqlite3

It took my Amazon EC2 extra-large instance about 8 hours to do the import. I'm going to put up a torrent of the finished sqlite database, as well as upload it on rapidshare or something. I'll post the links here.

Also, I'm going to fork the codebase and update the docs. This is one of the coolest libraries out there. I hope we can come together as a community and keep this thing working.

aprajnaparamita commented 12 years ago

I've uploaded a torrent of the full data here: http://assuredwebdevelopment.com/geocoder_us_tigerline_2011.7z.torrent

Backup here: http://www.mybtfiles.com/torrents/65950942/

hekaldama commented 12 years ago

Can someone just upload their sqlite db file with 2011 loaded so that we can just use that? Are there problems with this approach?

campgurus commented 12 years ago

hekaldama: I did, it's in my last post. I uploaded it as a Torrent file. Let me know how that works out.

hekaldama commented 12 years ago

Trying to download now. I am not sure if my firewall is blocking me or not, but it currently isn't downloading...

mattyb commented 12 years ago

I used this method on the TIGER2012 data. I was able to import and pass the tests. However, there are several lines like this in the log: /tmp/tiger-import.9161/*_addr.dbf: dbf file (.dbf) can not be opened. is that something I should be worried about?

aprajnaparamita commented 12 years ago

Here you go guys: https://www.dropbox.com/s/7so3ivq2npxcndy/geocoder_us_tigerline_2011.7z

DonFuego commented 11 years ago

Anyone uploaded a 2012 sqlite built database? This 2011 7z file is throwing an error trying to decompress :/

Shelnutt2 commented 10 years ago

Here is the 2014 raw sqlite db. http://downloads.codefi.re/shelnutt2/geocoder_tiger_2014.db