fecgov / FEC

A general discussion forum for FEC.gov. This is the best place to submit general feedback.
https://www.fec.gov
Other
227 stars 55 forks source link

Database dump, not just raw csv file dump #16

Closed Solomon closed 9 years ago

Solomon commented 10 years ago

Getting FEC data into a database is difficult. https://github.com/Solomon/opensecrets_to_postgres

Since I imagine the data is originally coming from a database, can you open up your nightly database backups?

dwillis commented 10 years ago

OpenSecrets data isn't exactly the same as FEC data. I think getting FEC data into a database could be made easier, but it now comes in delimited per-cycle files. I'm not sure how much easier it could be.

Solomon commented 10 years ago

After reading the issue again, it was definitely unclear, so sorry for the confusion.

My thought was that the FEC is exporting a whole bunch of files from a database, which then immediately need to be uploaded to another database by people using the data, and we could just skip the middle step with a db dump.

It took me about a month of part-time grunt work to get the OpenSecrets data into a database, and I imagine it's the same for anyone using the raw FEC data. You've got to figure out the schema, create the tables, download and organize the 100 csv files, script the data imports, and deal with all the weird formatting errors.

By releasing the database dump, you eliminate the need for people to spend all that time getting the data into a usable format. You also open up the data to a class of people who are comfortable using SQL, but don't have enough of a background to get the raw data into a database.

dwillis commented 10 years ago

I'm just not convinced that this is a big problem for a lot of people. A couple of thoughts:

  1. Switching to a db dump method would break backwards compatibility for long-time and heavy users of this data. The schemas are readily available, and the variations in importing the per-cycle files are very small (but should be better documented). I'm not sure I've met too many folks who are comfortable with SQL but can't import data. Also, pretty sure FEC uses Oracle - should it be an Oracle dump? A MySQL dump? Postgres? I don't really want them picking the format. There are also academic users who use R. Text files are much more useful to many of them than a SQL dump.
  2. There are Ruby and Python libraries for importing the FTP data, and probably some others. The formatting errors you mention do exist, but they can (and should) be fixed in the FTP data so that imports aren't a problem.
Solomon commented 10 years ago

Fair enough. I wasn't advocating getting rid of the text file dumps, since I know a lot of people rely on them for data pipelines. Rather, since the backups are happening anyways, just opening them up might be helpful.

If the FEC is using Oracle, opening up those backups wouldn't be useful for most people so I'm cool with closing this issue.

catherinedevlin commented 9 years ago

For our openFEC project so far, we are building a PostgreSQL mirror of FEC's (Oracle) CFDM data warehouse, and in principle making a pg_dump of it available would be technically easy. The questions to be answered would be

saizai commented 9 years ago

@catherinedevlin The FEC has no authority to place any usage restrictions on the data. It's open government data, no copyright. You can't use donor info for ~ commercial solicitation purposes — but that's not a restriction on dissemination.

saizai commented 9 years ago

(Also, I'm looking into hosting a mysql based mirror on makeyourlaws.org.)

Solomon commented 9 years ago

@catherinedevlin You might be able to host the file on github releases, which has no file or bandwidth limit right now (although that might change) https://help.github.com/articles/about-releases/

On Sat, Mar 28, 2015 at 2:03 PM, Sai notifications@github.com wrote:

(Also, I'm looking into hosting a mysql based mirror on makeyourlaws.org.)

— Reply to this email directly or view it on GitHub https://github.com/18F/FEC/issues/16#issuecomment-87274611.

dwillis commented 9 years ago

@saizai: As your comment makes clear, the FEC has the authority to place usage restrictions on it - donor solicitation is a use - but more to the point, @catherinedevlin isn't making a copyright claim. Making a database dump available isn't a trivial matter, and it needs to be something the FEC can maintain and pay for.

Perhaps more important, you can already load the bulk files. Anyone who could make use of a database dump could also load the files themselves.

saizai commented 9 years ago

@dwillis Loading the files is fairly time-consuming etc. Yes, I can do it, but why should everyone have to duplicate the effort vs straight up SQL? One could even have an open SQL master set up so anyone can slave the tables directly…

Anyway, since I'm already loading it in bulk, I'll look into hosting it in SQL format for others too. It's a nice thing for my non-profit to do. ;-)

(Also, usage restrictions != dissemination restrictions. Unlike, say, voter registration databases, which have all sorts of different restrictions imposed by the various states.)

@Solomon While releases might not have file limits, github itself does. https://help.github.com/articles/working-with-large-files/

dwillis commented 9 years ago

@saizai Yes, loading the files is time-consuming, although you can script it so it happens while you sleep :-) (And you made the initial point about usage restrictions, not me. I was just saying that it was a misplaced criticism, as the FEC clearly has the legal authority to do so.)

I think one good reason why I'd rather the FEC not provide SQL dumps is that it increases the amount of work the FEC has to do (and the work is essentially duplicative). Any time the FEC spends on this is at the expense of something they already are doing. The data is already being disseminated; should the FEC provide dumps in multiple systems? Do they support every db platform? No-SQL?

In other words, "straight up SQL" is a matter of opinion and perspective. I'd rather the FEC spend its time on improving its data offerings, not duplicating what it already has.

LindsayYoung commented 9 years ago

duplicate of 18F/openFEC/issues/1329

Solomon commented 8 years ago

I think providing the database dumps is still worthwhile, and the benefit for the extra work would outweigh the effort involved.

In particular, opening up a dump of the data would allow people who have basic SQL skills to run a pg_restore and immediately start querying the data. There are a lot of people who know SQL, but don't have the skills to build a data pipeline.

With CSV files, even for people who have the skills to write a data pipeline, it could take a few months of part time work to get the data into postgres. It took me around 3 months of part time work to build a pipeline for the OpenSecrets data, and it's still a big pain to keep updated. This is a prohibitive amount of work for a lot of people and institutions who would want to use this data.

By creating the dump, you also open up this data to be used in hackathons, weekend projects, and a host of other places where people will be scared away by the work of creating the pipeline.

I would be very open to implementing this with as little overhead as possible, in a way that made sense to the FEC / 18f. I wouldn't expect daily releases. Even if you only released a database dump every month or every quarter, it would be incredibly useful. Even if you only released one database dump ever, people would get a lot of benefit from it.