Nonprofit-Open-Data-Collective / irs-990-data-issue-tracker

A place to aggregate questions about IRS 990 data access, documentation, meta-data, and inconsistencies or errors. This is NOT a forum for questions on analyzing the data. Contributors are volunteer experts, not IRS personnel.
https://nonprofit-open-data-collective.github.io/irs-990-data-issue-tracker/
3 stars 0 forks source link

Some of the XML files uploaded in 2023 are malformed #5

Open HFAwesomeCharts opened 1 year ago

HFAwesomeCharts commented 1 year ago

We ran into at least two malformed XML files for the 990PF returns --- one each in the 2020 and 2021 filing years.

This only cropped up when the data was pulled into our JSON files. This caused us to remove the bad records from the JSON files by hand before we could use them. These are the object_ids of the malformed files we encountered:

OBJECT ID = 202122159349101002 (for 2020 tax year) OBJECT ID = 202211579349100411 (for 2021 tax year)

HFAwesomeCharts commented 1 year ago

Update on this: we re-pulled the data from the IRS in late July, which included a number of new filings. We ran into malformed filings in the new data set as well. Again, one each in 2020 and 2021. Interestingly, the OBJECT IDs this time were different, but they were on the same row numbers of our JSON files as the last time we built them in June. So now we're wondering if it isn't specific XML filings that are malformed, but something about the way they behave when they are pulled into JSON files?

Anyway, the two bad IDs we found for our late July JSON files builds were:

OBJECT ID = 202122159349100817, EIN = 226908681 (row 52,572, for 2020 tax year) OBJECT ID = 202211309349102006, EIN = 436069388 (row 39,389 for 2021 tax year)

I think @lecy said that others may have run into malformed XML filings; maybe they have other examples of specific IDs that were problematic?