Closed Solomon closed 9 years ago
OpenSecrets data isn't exactly the same as FEC data. I think getting FEC data into a database could be made easier, but it now comes in delimited per-cycle files. I'm not sure how much easier it could be.
After reading the issue again, it was definitely unclear, so sorry for the confusion.
My thought was that the FEC is exporting a whole bunch of files from a database, which then immediately need to be uploaded to another database by people using the data, and we could just skip the middle step with a db dump.
It took me about a month of part-time grunt work to get the OpenSecrets data into a database, and I imagine it's the same for anyone using the raw FEC data. You've got to figure out the schema, create the tables, download and organize the 100 csv files, script the data imports, and deal with all the weird formatting errors.
By releasing the database dump, you eliminate the need for people to spend all that time getting the data into a usable format. You also open up the data to a class of people who are comfortable using SQL, but don't have enough of a background to get the raw data into a database.
I'm just not convinced that this is a big problem for a lot of people. A couple of thoughts:
Fair enough. I wasn't advocating getting rid of the text file dumps, since I know a lot of people rely on them for data pipelines. Rather, since the backups are happening anyways, just opening them up might be helpful.
If the FEC is using Oracle, opening up those backups wouldn't be useful for most people so I'm cool with closing this issue.
For our openFEC project so far, we are building a PostgreSQL mirror of FEC's (Oracle) CFDM data warehouse, and in principle making a pg_dump of it available would be technically easy. The questions to be answered would be
@catherinedevlin The FEC has no authority to place any usage restrictions on the data. It's open government data, no copyright. You can't use donor info for ~ commercial solicitation purposes — but that's not a restriction on dissemination.
(Also, I'm looking into hosting a mysql based mirror on makeyourlaws.org.)
@catherinedevlin You might be able to host the file on github releases, which has no file or bandwidth limit right now (although that might change) https://help.github.com/articles/about-releases/
On Sat, Mar 28, 2015 at 2:03 PM, Sai notifications@github.com wrote:
(Also, I'm looking into hosting a mysql based mirror on makeyourlaws.org.)
— Reply to this email directly or view it on GitHub https://github.com/18F/FEC/issues/16#issuecomment-87274611.
@saizai: As your comment makes clear, the FEC has the authority to place usage restrictions on it - donor solicitation is a use - but more to the point, @catherinedevlin isn't making a copyright claim. Making a database dump available isn't a trivial matter, and it needs to be something the FEC can maintain and pay for.
Perhaps more important, you can already load the bulk files. Anyone who could make use of a database dump could also load the files themselves.
@dwillis Loading the files is fairly time-consuming etc. Yes, I can do it, but why should everyone have to duplicate the effort vs straight up SQL? One could even have an open SQL master set up so anyone can slave the tables directly…
Anyway, since I'm already loading it in bulk, I'll look into hosting it in SQL format for others too. It's a nice thing for my non-profit to do. ;-)
(Also, usage restrictions != dissemination restrictions. Unlike, say, voter registration databases, which have all sorts of different restrictions imposed by the various states.)
@Solomon While releases might not have file limits, github itself does. https://help.github.com/articles/working-with-large-files/
@saizai Yes, loading the files is time-consuming, although you can script it so it happens while you sleep :-) (And you made the initial point about usage restrictions, not me. I was just saying that it was a misplaced criticism, as the FEC clearly has the legal authority to do so.)
I think one good reason why I'd rather the FEC not provide SQL dumps is that it increases the amount of work the FEC has to do (and the work is essentially duplicative). Any time the FEC spends on this is at the expense of something they already are doing. The data is already being disseminated; should the FEC provide dumps in multiple systems? Do they support every db platform? No-SQL?
In other words, "straight up SQL" is a matter of opinion and perspective. I'd rather the FEC spend its time on improving its data offerings, not duplicating what it already has.
duplicate of 18F/openFEC/issues/1329
I think providing the database dumps is still worthwhile, and the benefit for the extra work would outweigh the effort involved.
In particular, opening up a dump of the data would allow people who have basic SQL skills to run a pg_restore
and immediately start querying the data. There are a lot of people who know SQL, but don't have the skills to build a data pipeline.
With CSV files, even for people who have the skills to write a data pipeline, it could take a few months of part time work to get the data into postgres. It took me around 3 months of part time work to build a pipeline for the OpenSecrets data, and it's still a big pain to keep updated. This is a prohibitive amount of work for a lot of people and institutions who would want to use this data.
By creating the dump, you also open up this data to be used in hackathons, weekend projects, and a host of other places where people will be scared away by the work of creating the pipeline.
I would be very open to implementing this with as little overhead as possible, in a way that made sense to the FEC / 18f. I wouldn't expect daily releases. Even if you only released a database dump every month or every quarter, it would be incredibly useful. Even if you only released one database dump ever, people would get a lot of benefit from it.
Getting FEC data into a database is difficult. https://github.com/Solomon/opensecrets_to_postgres
Since I imagine the data is originally coming from a database, can you open up your nightly database backups?