Closed daguar closed 11 years ago
Here is a typical webpage when you look up each NAICS code: http://www.census.gov/cgi-bin/sssd/naics/naicsrch?code=541430&search=2012%20NAICS%20Search (It's not "Census Data" in the sense most people think of Census Data; it's just NAICS information hosted on the Census website.)
There are three parts to these pages that I want to scrape:
Below this is an (optional) "index listing" which are essentially alternate titles for this code, but there is an XLS spreadsheet that provides this information already so I won't need to scrape it.
Right now, I'd like to store everything in the JSON file that I have for everything. Further down the line, it's worth considering some other data store for all of this, since someone requesting a list of codes and titles probably won't need all the cross-references or descriptions as well.
cc @ycombinator
I just issued a pull request with a basic Ruby scraper that seems to work for 2012 codes at least, capturing your first bullet above.
Give it a spin and let me know what you think.
To try it out:
gem install GEMNAMEGOESHERE
)ruby naics_scraper.rb
: you'll be given an interactive prompt where you can now type Ruby codeI'm actually throwing the scraper itself in a separate repo (see rationale in closed pull request) https://github.com/daguar/naics-scraper
Will communicate here for adding the data into the API.
Why not have the scraper use the API to get the data in? — Sent from Mailbox for iPhone
On Sat, May 25, 2013 at 12:28 PM, daguar notifications@github.com wrote:
I'm actually throwing the scraper itself in a separate repo (see rationale in closed pull request) https://github.com/daguar/naics-scraper
Will communicate here for adding the data into the API.
Reply to this email directly or view it on GitHub: https://github.com/louh/naics-api/issues/5#issuecomment-18452558
@ycombinator Could well-do, but right now I wanted to isolate the scraper itself for neat separation and discoverability. Hopefully that doesn't obviate the ability to then import (scraped, then reviewed) data via the API.
@louh Check out the sample data I've dumped so far and let me know if you spot any big problems: https://github.com/daguar/naics-scraper/blob/master/sample-data-2012-incomplete.json
Actually, check out this data (all 2012 codes): https://github.com/daguar/naics-scraper/blob/master/complete-data-2012-052513-145pmPT.json
I just added this last scraped JSON to the API. The ones that say "See industry description for" is still added as-is though.
A lot of this has been added now thanks to additional scraper work by @migurski.
@louh -- Could you describe this a bit more in this issue? (Perhaps with links and what you want to achieve?)
Happy to help out if it's quick.