Getting Census website data on NAICS

daguar commented 11 years ago

@louh -- Could you describe this a bit more in this issue? (Perhaps with links and what you want to achieve?)

Happy to help out if it's quick.

louh commented 11 years ago

Here is a typical webpage when you look up each NAICS code: http://www.census.gov/cgi-bin/sssd/naics/naicsrch?code=541430&search=2012%20NAICS%20Search (It's not "Census Data" in the sense most people think of Census Data; it's just NAICS information hosted on the Census website.)

There are three parts to these pages that I want to scrape:

The text description (immediately below the NAICS title), which could be one more paragraphs, or could just be a one line "see also" pointing to another code;
Illustrative examples (optional, in the sense that not all codes will have this) which describe typical businesses that fit this category;
Cross-references (optional) which describe similar businesses that would actually be in another category.

Below this is an (optional) "index listing" which are essentially alternate titles for this code, but there is an XLS spreadsheet that provides this information already so I won't need to scrape it.

Right now, I'd like to store everything in the JSON file that I have for everything. Further down the line, it's worth considering some other data store for all of this, since someone requesting a list of codes and titles probably won't need all the cross-references or descriptions as well.

cc @ycombinator

daguar commented 11 years ago

I just issued a pull request with a basic Ruby scraper that seems to work for 2012 codes at least, capturing your first bullet above.

Give it a spin and let me know what you think.

To try it out:

Make sure you have Ruby installed!
Install all gems listed at the top (in terminal, run gem install GEMNAMEGOESHERE)
From terminal in the content-scraper directory, run ruby naics_scraper.rb: you'll be given an interactive prompt where you can now type Ruby code
Use the example code provided at bottom (in comments)

daguar commented 11 years ago

I'm actually throwing the scraper itself in a separate repo (see rationale in closed pull request) https://github.com/daguar/naics-scraper

Will communicate here for adding the data into the API.

ycombinator commented 11 years ago

Why not have the scraper use the API to get the data in? — Sent from Mailbox for iPhone

On Sat, May 25, 2013 at 12:28 PM, daguar notifications@github.com wrote:

I'm actually throwing the scraper itself in a separate repo (see rationale in closed pull request) https://github.com/daguar/naics-scraper

Will communicate here for adding the data into the API.

Reply to this email directly or view it on GitHub: https://github.com/louh/naics-api/issues/5#issuecomment-18452558

daguar commented 11 years ago

@ycombinator Could well-do, but right now I wanted to isolate the scraper itself for neat separation and discoverability. Hopefully that doesn't obviate the ability to then import (scraped, then reviewed) data via the API.

daguar commented 11 years ago

@louh Check out the sample data I've dumped so far and let me know if you spot any big problems: https://github.com/daguar/naics-scraper/blob/master/sample-data-2012-incomplete.json

daguar commented 11 years ago

Actually, check out this data (all 2012 codes): https://github.com/daguar/naics-scraper/blob/master/complete-data-2012-052513-145pmPT.json

louh commented 11 years ago

I just added this last scraped JSON to the API. The ones that say "See industry description for" is still added as-is though.

louh commented 11 years ago

A lot of this has been added now thanks to additional scraper work by @migurski.

codeforamerica / naics-api

Getting Census website data on NAICS #5

Will communicate here for adding the data into the API.