Initial Discussion - Githubissues

ajb commented 11 years ago

Background

Yesterday, someone brought to my attention this question on the new Open Data Stack Exchange. (Asked by Christopher Whitaker, no less.) To paraphrase the question:

"What if we had the same data schema in multiple cities so that we could start to compare procurement across different cities? "

To paraphrase my answer, we should be building on top of NAICS codes, instead of around them.

The major flaw with NAICS is that it's not user friendly enough. This can be solved by civic hackers. I propose we build an abstraction layer on top of NAICS that maps the codes to not just their official descriptions, but synonyms as well. We can also create "groups" so that someone could choose "Web Programming", and get back the 4 or 5 codes that are applicable.

This seems like it would immensely important to our common goals. Do you guys think there's a good venue for making this a community project, and if so, what is it? (I can't keep track of all the civic hack days/hackathons/summer of codes/brigades, but hey, that's a good thing.)

Desired outcomes

I currently have 3 in mind:

An open dataset in a modern format (JSON) that includes all the NAICS codes, their descriptions, and common synonyms for each. Also includes a "groups" table that lists common businesses and the NAICS codes associated with each. e.g. "Web Developer" matches "Software programming services, custom computer" and "Application hosting". More results is always better.

Not sure how much demand there is for these groups, so that will need to be investigated.

A hosted API for the dataset. Can query by virtually any parameter. Speed is key.

A website that serves as the first client for the API. Allows a business to find their NAICS codes quickly and easily.
And down the road...

Map NAICS codes to other industry codes. Some of these mappings already exist.
Next Steps

Would love to have some more discussion around this before anything else. Let's keep discussion in this thread for now.

cjoh commented 11 years ago

Might be interesting to talk to the OpenCorporates people about this. http://opencorporates.com/

On Fri, May 24, 2013 at 2:26 PM, Adam Becker notifications@github.comwrote:

Background

Yesterday, someone brought to my attention this question on the new Open Data Stack Exchange. (Asked by Christopher Whitaker, no less.) To paraphrase the question:

"What if we had the same data schema in multiple cities so that we could start to compare procurement across different cities? "

To paraphrase my answer, we should be building on top of NAICS codes, instead of around them.

The major flaw with NAICS is that it's not user friendly enough. This can be solved by civic hackers. I propose we build an abstraction layer on top of NAICS that maps the codes to not just their official descriptions, but synonyms as well. We can also create "groups" so that someone could choose "Web Programming", and get back the 4 or 5 codes that are applicable.

This seems like it would immensely important to our common goals. Do you guys think there's a good venue for making this a community project, and if so, what is it? (I can't keep track of all the civic hack days/hackathons/summer of codes/brigades, but hey, that's a good thing.)

Desired outcomes

I currently have 3 in mind:

An open dataset in a modern format (JSON) that includes all the NAICS codes, their descriptions, and common synonyms for each. Also includes a "groups" table that lists common businesses and the NAICS codes associated with each. e.g. "Web Developer" matches "Software programming services, custom computer" and "Application hosting". More results is always better.

Not sure how much demand there is for these groups, so that will need to be investigated.

-

A hosted API for the dataset. Can query by virtually any parameter. Speed is key.

A website that serves as the first client for the API. Allows a business to find their NAICS codes quickly and easily.

And down the road...

Map NAICS codes to other industry codes. Some of these mappings already exist http://www.naics.com/search.htm.

Next Steps

Would love to have some more discussion around this before anything else. Let's keep discussion in this thread for now.

— Reply to this email directly or view it on GitHubhttps://github.com/dobtco/NAICS/issues/1 .

Clay Johnson http://about.me/clayjohnson Are you on an Information Diet? http://amzn.to/infodiet

cjoh commented 11 years ago

Also worth noting is the NIGP Code -- which is a parallel to NAICS. In doing research on procurement websites for cities, NIGP is coming up more often than NAICS. http://en.wikipedia.org/wiki/NIGP_Code

On Fri, May 24, 2013 at 2:28 PM, Clay Johnson clay@clayjohnson.org wrote:

Might be interesting to talk to the OpenCorporates people about this. http://opencorporates.com/

On Fri, May 24, 2013 at 2:26 PM, Adam Becker notifications@github.comwrote:

Background

Yesterday, someone brought to my attention this question on the new Open Data Stack Exchange. (Asked by Christopher Whitaker, no less.) To paraphrase the question:

"What if we had the same data schema in multiple cities so that we could start to compare procurement across different cities? "

To paraphrase my answer, we should be building on top of NAICS codes, instead of around them.

The major flaw with NAICS is that it's not user friendly enough. This can be solved by civic hackers. I propose we build an abstraction layer on top of NAICS that maps the codes to not just their official descriptions, but synonyms as well. We can also create "groups" so that someone could choose "Web Programming", and get back the 4 or 5 codes that are applicable.

This seems like it would immensely important to our common goals. Do you guys think there's a good venue for making this a community project, and if so, what is it? (I can't keep track of all the civic hack days/hackathons/summer of codes/brigades, but hey, that's a good thing.)

Desired outcomes

I currently have 3 in mind:

An open dataset in a modern format (JSON) that includes all the NAICS codes, their descriptions, and common synonyms for each. Also includes a "groups" table that lists common businesses and the NAICS codes associated with each. e.g. "Web Developer" matches "Software programming services, custom computer" and "Application hosting". More results is always better.

Not sure how much demand there is for these groups, so that will need to be investigated.

-

A hosted API for the dataset. Can query by virtually any parameter. Speed is key.

A website that serves as the first client for the API. Allows a business to find their NAICS codes quickly and easily.

And down the road...

Map NAICS codes to other industry codes. Some of these mappings already exist http://www.naics.com/search.htm.

Next Steps

Would love to have some more discussion around this before anything else. Let's keep discussion in this thread for now.

— Reply to this email directly or view it on GitHubhttps://github.com/dobtco/NAICS/issues/1 .

Clay Johnson http://about.me/clayjohnson Are you on an Information Diet? http://amzn.to/infodiet

Clay Johnson http://about.me/clayjohnson Are you on an Information Diet? http://amzn.to/infodiet

ajb commented 11 years ago

Ah, that's the one that I couldn't remember.

Maybe we should expand this project to not be standard-specific, and instead house as many different kinds of codes as possible. (With instructions for adding others.)

spjika commented 11 years ago

Interested, and given that once again we have sold government IP to be privatized by a firm (NIGP) this is a fight too- which is all the more fun. Is this going to be like the DC Code first, to get access freely to the public NIGP data?

cjoh commented 11 years ago

Commodity (three digit) NIGP codes can be found on this word document, courtesy of Detroit:

http://www.detroitmi.gov/Portals/0/docs/finance/purchasing/Purchasing%20Vendor%20Application%20and%20Commodity%20Code%20Listing.doc

On Fri, May 24, 2013 at 6:44 PM, Spike notifications@github.com wrote:

Interested, and given that once again we have sold government IP to be privatized by a firm (NIGP) this is a fight too- which is all the more fun. Is this going to be like the DC Code first, to get access freely to the public NIGP data?

— Reply to this email directly or view it on GitHubhttps://github.com/dobtco/NAICS/issues/1#issuecomment-18434084 .

Clay Johnson http://about.me/clayjohnson Are you on an Information Diet? http://amzn.to/infodiet

daguar commented 11 years ago

So, this is kind of ridiculous because @adamjacobbecker and I were randomly chatting while I was writing a scraper for... a NAICS API.

@louh started this project, and I'm working on scraping content: https://github.com/louh/naics-api

Current deployed API example: http://naics-api.herokuapp.com/v0/q?year=2012&code=519120

And I'm working on getting the text content scraped: see this issue -- https://github.com/louh/naics-api/issues/5 and my pull request https://github.com/louh/naics-api/pull/6

eddietejeda commented 11 years ago

I am a fan of hosted APIs. You can embed the service into an app and as issues are clarified, everyone's app is updated.

But, as @adamjacobbecker said, performance is key. If you go that route, you have to make sure it can handle being used, possibly to autocomplete search terms.

I would need to give this a bit more thought, but what if the API is just JSON file hosted on Amazon and then, as @daguar did, create simple wrappers/scrappers in a few languages that can easily sync?

Instead of being dependent on a fully online service, there is also a local cache. The local library can try to sync daily/weekly/monthly and the user of the library will get nice methods, like:

NAICS::get_json(:code_id)

or more specific: NAICS::get_title(:code_id) NAICS::get_description(:code_id) NAICS::get_group_label(:code_id)

and even nice things like: NAICS::find_by_title(:search_term) NAICS::find_by_description(:search_term)

And the option to sync: NAICS::force_sync_with_api()

...

Ultimately, this is how most developers would use the service and not have to write their own custom wrappers with every single implementation.

louh commented 11 years ago

The thing about NAICS is that they change only every five years. Once data is there, a developer that needs the data can just deploy the API to a server that works best for them and not have to worry about it again for a while. That's a compromise solution; best case scenario is not having to go through that extra step. Then, you also get new features immediately when they update, such as better search functionality. (I haven't shown this to you guys yet, but the Census search form is terrible. I have one use case, where you search for "Space Research," and it misses a few obvious matches where the terms are right there in the title.)

On Sat, May 25, 2013 at 11:34 AM, Eddie A Tejeda notifications@github.comwrote:

I am a fan of hosted APIs. You can embed the service into an app and as issues are clarified, everyone's app is updated.

But, as @adamjacobbecker https://github.com/adamjacobbecker said, performance is key. If you go that route, you have to make sure it can handle being used, possibly to autocomplete search terms.

I would need to give this a bit more thought, but what if the API is just JSON file hosted on Amazon and then, as @daguarhttps://github.com/daguardid, create simple wrappers/scrappers in a few languages that can easily sync?

Instead of being dependent on a fully online service, there is also a local cache. The local library can try to sync daily/weekly/monthly and the user of the library will get nice methods, like:

NAICS::get_json(:code_id)

or more specific: NAICS::get_title(:code_id) NAICS::get_description(:code_id) NAICS::get_group_label(:code_id)

and even nice things like: NAICS::find_by_title(:search_term) NAICS::find_by_description(:search_term)

And the option to sync: NAICS::force_sync_with_api()

...

Ultimately, this is how most developers would use the service and not have to write their own custom wrappers with every single implementation.

— Reply to this email directly or view it on GitHubhttps://github.com/dobtco/NAICS/issues/1#issuecomment-18451864 .

Lou Huang 2013 Fellow, Code for America / San Francisco / Las Vegas lou@codeforamerica.org | 510.364.0641 | @saikofishhttp://twitter.com/saikofish | louhuang.com http://www.louhuang.com/

ajb commented 11 years ago

I really like the idea of the data just living in a JSON file somewhere, as long as it remains small enough. Somehow I thought the database would be a lot bigger, but lou's current db is < .5mb.

@eddietejeda are there any precedents for libraries like you describe? How would we cache everything locally? One issue I can see having is searchability -- how fast will it be to do a full-text search of all codes? Will it still be fast enough when/if we add NIGP codes? Prove me wrong, but I feel like we can only get away with our database being a .json file for so long.

daguar commented 11 years ago

I'm a big fan of a dump+API approach.

If you want to do autocomplete, you REALLY don't want to be doing that with a remote service you don't control. I think some API endpoint for a hash check about whether to update your local collection is the best approach.

On May 26, 2013, at 10:51 AM, Adam Becker notifications@github.com wrote:

I really like the idea of the data just living in a JSON file somewhere, as long as it remains small enough. Somehow I thought the database would be a lot bigger, but lou's current db is < .5mb.

@eddietejeda are there any precedents for libraries like you describe? How would we cache everything locally? One issue I can see having is searchability -- how fast will it be to do a full-text search of all codes? Will it still be fast enough when/if we add NIGP codes? Prove me wrong, but I feel like we can only get away with our database being a .json file for so long.

— Reply to this email directly or view it on GitHub.

louh commented 11 years ago

@adamjacobbecker I'm also missing descriptions and keywords right now, which @daguar is writing a scraper for. The full text in their PDF (I just did a straight copy-paste of its contents to a text file) is about 1.5mb. We're likely to move toward a database and have it return results in JSON. For now, codes and titles are just JSON so I can test out the API.

+1 to @daguar's suggestion above, although having a remote service available will be great for anyone writing a small script and doesn't want an extra step for a local deployment. But ultimately, I'm doing the NAICS work to make it easier to write a better business registration tool for Las Vegas (e.g. OpenCounter), so if we wanted to do autocomplete (which we might) then we'd also have our local API for that.

kaitlin commented 11 years ago

I think this work to make NAICS more accessible is great, but I want to raise what I think is a substantive point against building on top of NAICS. In practice, NAICS is a self-reported industrial classification of the contractor, and not necessarily a classification of what is being purchased. The federal contracting data has Product and Service codes in addition to NAICS that detail the category of the item purchased. A big contractor like General Dynamics may self-report it's industrial classification as whatever is on it's tax form or D&B registration, but they could be selling the government something that is classified under a totally different industrial category. It's one thing if the granularity of NAICS is lacking but if it's not describing the actual thing being bought, then I think it is not so useful.

louh commented 11 years ago

@kaitlin This limitation of NAICS is probably why they've extended it to a system called NAPCS (North American Product Classification System) to address what you're talking about here. I'm personally not at all familiar with NAPCS, but a cursory glance through some pages seem to indicate that (1) this is a draft still and (2) it's not in widespread use. Perhaps the goal is to eventually replace NIGP? I'm just spitballing here, but eventually if this system gets adopted then it would be great to have a NAICS API already established to build from.

ajb commented 11 years ago

paging @GovInTrenches :pager: :bell:

ajb commented 11 years ago

(I haven't shown this to you guys yet, but the Census search form is terrible. I have one use case, where you search for "Space Research," and it misses a few obvious matches where the terms are right there in the title.)

@cjoh has a screenshot of my favorite search on FedBizOpps: we queried the system for "web design" and the first result was for a trebuchet. no lie.

GovInTrenches commented 11 years ago

Sorry, National Day of Civic Hacking is taking up all my bandwidth

SmartChicago should be able to provide hosting for the app. Once NDoCH ends I should also be able to put more research time in. I agree about the API. Also would be more than happy to do necessary grunt work to get NGIP stuff.

ajb commented 11 years ago

No worries, thanks @GovInTrenches. Just wanted to make sure there was nothing big we were missing since it seems like you're familiar with the issue.

louh commented 11 years ago

National Day of Civic Hacking story:

A group here is playing around with business license data that Las Vegas released for the hackathon. They store some very generic business type names, but have more fine grained data with NAICS codes. So I pointed them to the naics-api repo.

Then, to make it work a bit better for them, I built search functionality:

http://naics-api.herokuapp.com/v0/s?year=2012&terms=libraries

It still sucks, as far as what it needs to do eventually. It hasn't surpassed Census search functionality yet. But it's a start, and I'd just like to humblebrag because I did this and I don't even consider myself a real programmer. And if there's anything anyone wants to do to make it better, please do.

louh commented 11 years ago

Whooo here is a demo:

http://louh.github.io/naics-search/

dobtco / NAICS

Initial Discussion #1

Background

Desired outcomes

And down the road...

Next Steps

A hosted API for the dataset. Can query by virtually any parameter. Speed is key.

A hosted API for the dataset. Can query by virtually any parameter. Speed is key.