esonderegger / fecfile

a python parser for the .fec file format
https://esonderegger.github.io/fecfile/
Apache License 2.0
44 stars 13 forks source link

Where are you getting the mappings data? #39

Open NickCrews opened 1 year ago

NickCrews commented 1 year ago

Hi!

I'm working on a port of this to rust. I'm trying to decide where to source the schema mappings. Possible options I've found are:

I found that FastFEC chose to use this repo as their upstream.

Could you explain what you see as the pros/cons of each of these?

So far, what I see are:

Are there other considerations I'm missing?

I would love to make it so that there was one complete and accurate listing of schemas so that the wide range of parsers would not have to duplicate this effort. Any idea what would be required to make that happen?

CC @esonderegger @dwillis @freedmand

esonderegger commented 1 year ago

If I had one regret with respect to how this library is set up, it would be that I wish I had used a git submodule to have the CSV files from https://github.com/dwillis/fech-sources be the source of mapping data.

At the time, the Senate was doing its filings via paper and those senate filings were important to my employer. So I spent a fair amount of time extending the js/json file that @chriszs had built for fec-parse (which also uses fech-sources as its upstream source of data). I found the json easy to hand-edit, but it's been a task long on my to-do list to get the mappings I added back into fech-sources.

The most definitive source of data for these mappings is the xls/xsls files the FEC hosts here. Click to expand "Electronically filed reports" and "Paper filed reports" and then click to download the "File formats, header file and metadata" files. I also highly recommend reading through those files before embarking on writing a parser as they will be a huge help in understanding how the filings are structured.

To answer your two questions:

What I see as the pros/cons of each are:

One additional reason to use the CSVs as your data source: if/when you find issues in the CSVs, if you make PRs into fech-sources to fix them, they will benefit everyone, as we are all downstream from them.

Good luck! Please don't hesitate to reach out if you have any questions.

chriszs commented 1 year ago

I spent some time a year ago trying to reproduce the CSVs from the original source Excel files. Well, actually to merge the two and create JSON Schemas with typing information. This won't come as a surprise to you, but what I found is that both are fairly dirty, the CSVs are sometimes incorrect, there are a ton of records when you multiply the number of fields by the number of versions, JSON Schema has a lot of depth and it's a difficult task. Some of that work was the basis of the draft PR you used as the basis for your fech-sources contribution. A contractor for the FEC was slowly working on a similar project in their fecfile-validate project, but with a slightly different scope (just current filings). FastFEC uses a version of the two .json files, including mappings.json, which I converted from Derek's original Ruby and then Evan and I improved over time. That's as close to a clean source as you'll find, though it originally derives from the CSVs.

chriszs commented 1 year ago

Oh, also Evan is correct about F3s. There's a PDF technical manual somewhere on the FEC site which details some of this, which if I can find again I'll link to.

NickCrews commented 1 year ago

Thank you both so much for this. Oh man I just got overwhelmed ;)

I've been looking into JSONSchema for a while, and I think I've concluded that I think it is overkill for what we need, but here are a few thoughts I had, and I thought I'd write them down.

JSONSchema musings

fecfile-validate looks as canonical as you can get. It looks like they are sourcing their schemas from the .xls files you mentioned above, but it looks like they also don't trust those .xls files and have to hand-edit them.

@chriszs by "current filings" do you mean fecfile-validate only supports filing versions 8.3+? That wouldn't be adequate for my (and I bet others) needs. I doubt the FEC will be motivated to support older versions, so we would need to supplement this.

@chriszs mentions the combinatorial explosion, but I think we could get around this by re-using sub-schemas. Am I missing something there? Still, I'm not sure if we need the full power of JSONSchema, and therefore I'm not sure if it's worth bringing in that complication. Am I right that all we need extra are the dtypes that should get parsed? Like I don't think we need the full

"form_type": {
            "title": "FORM TYPE",
            "description": "",
            "const": "SA11C",
            "examples": [
                "SA11C"
            ],
            "fec_spec": {
                "FIELD_DESCRIPTION": "FORM TYPE",
                "TYPE": "A/N-8",
                "REQUIRED": "X (error)",
                "SAMPLE_DATA": "SA11C",
                "VALUE_REFERENCE": "SA11C Only",
                "RULE_REFERENCE": null,
                "FIELD_FORM_ASSOCIATION": null
            }
        }

that JSONSchema provides.

Path Forward

OK, it sounds like updating fech-sources is what both of you are most supportive of, and I think that would work just fine for me. Adding types would be great, but just them being functional would be fine. I think the todos would be

[] merge https://github.com/dwillis/fech-sources/pull/11 [] figure out https://github.com/dwillis/fech-sources/issues/12 [] use the migration scripts @esonderegger wrote to bring back in the .json stuff, and add types. I can help here @esonderegger if you point me in the right direction.

CC @mjtravers from fecfile-validate, if you have any thoughts on how we could team up at all.

freedmand commented 1 year ago

Hi! Firstly: sorry I haven't been able to find time to get to your PR in FastFEC. (Though I have validated there is no perf difference in your version, I did notice some diffs I'm going through to try to figure out.) We will be focusing more on FEC at The Post later this year (I'm hoping to find time sooner). But I do want to chime in here to say:

  1. Great to have you working on this! I'm very curious to know what your goals/motivation are for developing this generally. I would love to be able to collaborate as effectively as possible. It's a small world of folks tackling these problems, and you've cc'd a good chunk of them. If you ever want to hop on a call to discuss any of this, I'd be game to find some time.

  2. FastFEC very much comes from translating fecfile to C, and it is downstream of all the work you've identified. It seems like you've already mostly uncovered this lineage, in addition to how the mapping files have been handed down. At some point, cleaning all this up and having a centralized source for these that's community- and/or FEC-maintained would be wonderful. I think the filings <8.3 are handled decently well by the current mappings files; at least they have been for our purposes at The Post loading many historic filings into a centralized/searchable database. But it would be worth investigating further; there's surely possible improvements.

  3. @chriszs indeed has put some time into trying to standardize the typings in a unified way. He can correct me but I think re: combinatorial expansion, there's just very minute differences between each version that would be hard to capture in a nice way, even with reusing sub-schemas. It's a painstaking process generally as the source xls files mentioned above are not always perfect. And filings themselves have various errors too despite this, so parsers may need various layers of tolerance baked in to handle weirdness (and filings can be very messy, e.g. missing columns, shifted columns, inconsistent formats, characters missing, etc.).

Looking forward to seeing what you come up with. And thanks for organizing this discussion.

chriszs commented 1 year ago

Yes, my design heavily uses sub-schemas.

Yes, there are a lot of edge cases.

Correct that fec-validate only seems interested in the current version.

I think a plan that focuses on improving the CSVs and converting from there sounds reasonable.