Nonprofit-Open-Data-Collective / irs-efile-master-concordance-file

The Master Concordance File defines standards and provides documentation necessary to build structured databases from the IRS E-File XML files posted on AWS.
https://nonprofit-open-data-collective.github.io/irs-efile-master-concordance-file/
40 stars 6 forks source link

fix some typos in the version column of the `efiler_master_concordance.csv` file #35

Open frisson opened 4 years ago

frisson commented 4 years ago

Description

A cursory glance at the version column showed some anomalies in the return versions column.

Running the pandas code below on the file this pr branches off of shows some of the anomalies.

concordance = pd.read_csv(
    "./efiler_master_concordance.csv",
    dtype=dict(
        variable_name="category",
        description="object",
        scope="category",
        location_code="category",
        form="category",
        part="category",
        data_type="category",
        required="boolean",
        cardinality="float64",
        rdb_table="float64",
        xpath="object",
        version="category",
        production_rule="float64",
        last_version_modified="object",
    ),
)
# normalize versions by splitting the string on ';' and stripping each element
# before joining them again
concordance["version"] = concordance.version.apply(
    lambda x: ";".join([x.strip().lower() for x in str(x).split(";") if x.strip()])
)
concordance.head()
versions = sorted(
    {
        ver
        for sublist in list(
            [str(ver).split(";") for ver in list(concordance.version.unique())]
        )
        for ver in sublist
        if ver
    }
)
print(sorted(versions))

=

lecy commented 4 years ago

helpful, thank you! there are some major changes to the master concordance files coming shortly - mostly much better variable names, cleaner variable mapping, and extended documentation. if useful (since you are using the concordance) i can share the draft versions of these.

frisson commented 4 years ago

hey @lecy, glad to hear this is useful. it'd be great to get a look at those drafts. specially if the changes are coming soon.

lecy commented 4 years ago

Sharing one section here for a preview:

https://github.com/Nonprofit-Open-Data-Collective/irs-efile-master-concordance-file/blob/master/emc-f990-part-01-v2.csv

And also the updated instructions that are being used to revise the concordance files:

https://github.com/Nonprofit-Open-Data-Collective/irs-efile-master-concordance-file/blob/master/Instructions%20for%20Updating%20Concordance%20v3.1.pdf

In summary:

  1. Variable names are the biggest overhaul since version 1.0 were script generated and lacked human interpretability.
  2. Refinement of all of the xpath to variable mappings.
  3. Moving the SCOPE flag (whether variables occur on the 990 only, 990-EZ only, or PZ for both) from the variable name to a distinct column so it's easier to select as an attribute when selecting variables for a study.
  4. Adding table names to split form sections into a relational database separating one-to-one and one-to-many fields.
  5. The aggregated master concordance file with all xpaths is being split into separate CSV files for forms + parts to make them easier to validate and maintain (for example form-990-part-01 above: emc-f990-part-01-v2.csv).

We have all sections of the 990 (Part I to Part XII) complete, and a handful of schedules.

Working on finishing up schedules this summer, and getting started on foundation files (the 990-PF).

Then just need to update all of the documentation as part of the release.

If you are actively using the concordance I can share these with you directly before the official release. Just let me know.

frisson commented 4 years ago

Hey @lecy,

Thanks so much for sharing the preview. It's very very useful. I've successfully used the preview to parse and extract the fields for a number of IRS990 returns from the 2018 return index. Are there any previews handy for the IRS990EZ return types?

Example mapping/transform for reference: https://gist.github.com/frisson/f9eaf2f4ea60ee5de694114c0a26e3e3

frisson commented 4 years ago

hey @lecy, just pinging you here to let you know i'm back from account review purgatory.

frisson commented 3 years ago

Hey @lecy, i hope all is well. Just wanted to bump this thread.

lecy commented 3 years ago

Let me get back to you tonight - have a bunch of files to share (just wrapping up the 990, 990-ez, and schedules).