GSA / datagov-wptheme

Data.gov WordPress Theme (obsolete)
https://www.data.gov
Other
1.88k stars 411 forks source link

Analysis on what open source projects are using federal datasets, using GitHub search #464

Open konklone opened 10 years ago

konklone commented 10 years ago

You can search the full text of all open source code on GitHub (which is insane and magnificent), and one trick is to search it for domain names and URLs.

There are over 45,000 results for searching for "data.gov" on code in GitHub, and ~350 issues. There's all kinds of data quality issues -- for example, data.gov is a subset of api.data.gov, dots aren't handled perfectly, and many repos will come up more than once. But it's still pretty cool.

I think it'd be more useful to take the URLs of datasets (and landing pages for datasets) known to Data.gov and run them through GitHub search and see what sort of things come up. It'd be a neat lead generator to find people working on niche or in-progress tools, that you'd be unlikely to come across in the press. Maybe it's even a way to get deserving projects more press.

Unfortunately, GitHub doesn't offer an API or feed of search result data across GitHub -- you have to use the web interface. (There is a Search API, but unlike the web interface, it requires that searches be limited to a user, organization, or repository.)

So you'd probably have to scrape GitHub.com, the website, and page through data. Which sounds no fun, but to the right kind of brain might also sound actually really fun.

One other maybe-fun caveat is that if there are datasets that have support libraries built for them, some projects may just reference those libraries instead -- and so the URL for the dataset would only appear in the library, not the project. I can imagine this being the case for a well-established data agency like the US Census, for example. So identifying support libraries, and searching GitHub's code for references to them would also help dig up some leads. But even for the Census' domain name, there are lots of little projects.

nsinai commented 9 years ago

This seems like an interesting project for someone to do -- maybe Sunlight would be interested? Or Waldo Jaquith?

rebeccawilliams commented 9 years ago

Couldn't hurt to directly ask! This is definitely a help wanted item that would be greatly beneficial to internal and external open data advocacy.

@jamesturk -- This might help with Sunlight's goals of finding Social Impact Examples.

@waldoj -- I can see the lots of little projects search being a great case for prioritizing high value data and also having lots of local implications!

Pinging @benbalter on this too, since it'd be great to highlight government's open source work with government open data and since maybe he has ideas on top of scraping GitHub.

benbalter commented 9 years ago

Awesome idea. As a developer, I suspect many code comments will have the URL to a dataset, which can be easily searched (typing the app back to the data).

I'd be glad to help with this effort, and can perform site-wide searches, if you send over e.g., a gist of keywords. If there's interest, I can also check in with the API team about the search API, but suspect that's the harder route.

nsinai commented 9 years ago

you rock, @benbalter

waldoj commented 9 years ago

Somehow I missed this when @rebeccawilliams tagged me.

I think it'd be more useful to take the URLs of datasets (and landing pages for datasets) known to Data.gov and run them through GitHub search and see what sort of things come up.

@konklone, wouldn't this be hundreds of thousands of URLs? Perhaps it would be easier to search GitHub for all references to .gov domains? Although I'm not sure what to do with that list once you had it. Maybe download all files that contain those references, determine what the URLs are, and once you had a frequency distribution, start to figure out how to filter it?

konklone commented 9 years ago

@konklone, wouldn't this be hundreds of thousands of URLs?

We'll probably need computers to do it for us!

Perhaps it would be easier to search GitHub for all references to .gov domains?

That's a cool idea too. I'm not sure if it's more computationally expensive to search for "anything ending with .gov", or a few hundred thousand scalar domains. Also, how much work it'd be to filter down to just the uses of datasets listed on Data.gov, out of all the .gov results.

I'd be glad to help with this effort, and can perform site-wide searches, if you send over e.g., a gist of keywords. If there's interest, I can also check in with the API team about the search API, but suspect that's the harder route.

@benbalter You're awesome. Maybe the easiest way to start is to pick the top ~1000 most viewed/used datasets, and see how that goes? Is it easy to export that from the Data.gov database or API?

Also @benbalter, GitHub has become such an absolutely amazing source of research -- I wonder if you could kick off a conversation internally on how to help researchers perform analysis of GitHub's contents. Maybe that's specially managed API keys (like how Twitter specially manages access to "the firehose"), or maybe it's a formal public system for queuing up large-scale search terms, or maybe it's even a public dataset of all public GitHub data (sent to the Internet Archive once a month?!) of everything in a giant bulk file.

waldoj commented 9 years ago

I'm just thinking that GitHub might not be thrilled to see hundreds of thousands of queries being issued against their search API. It basically sounds like a denial of service attack. :)