Nonprofit-Open-Data-Collective / irs-efile-master-concordance-file

The Master Concordance File defines standards and provides documentation necessary to build structured databases from the IRS E-File XML files posted on AWS.
https://nonprofit-open-data-collective.github.io/irs-efile-master-concordance-file/
40 stars 6 forks source link

MCF: Truncated variable_names in "minus year" variables. #1

Closed jsfenfen closed 6 years ago

jsfenfen commented 6 years ago

In many cases, the variable_name is identical for variables that look at prior years. In some cases the minus one year is singular and minus two years is plural.

Here's an example:

SA_02_PZ_ARDPCTYMYEAR | Current tax year minus one year SA_02_PZ_ARDPCTYMYEAR | Current tax year minus two years SA_02_PZ_ARDPCTYMYEAR | Current tax year minus three years SA_02_PZ_ARDPCTYMYEAR | Current tax year minus four years

jsfenfen commented 6 years ago

Here's a text file of all the ones that I could find. Haven't verified these are all problem, but at first blush looks like they are.

Elsewhere these vars have the number of minus years as a suffix, maybe they got lopped off?

minus_year_issue.txt

jsfenfen commented 6 years ago

Actually, there are more, the list above just included those with the word 'minus' in the description.

Hey @miguelabarbosa this looks to be a systemic issue in variable name generation whereby differences further up the tree are ignored? In doing this my approach was to generate all names in a given table using the last part of the xpath, and then, if they aren't unique, use the last two parts, repeating until they are unique enough...

more_minus_years.txt

jsfenfen commented 6 years ago

Partial fix here: https://github.com/Nonprofit-Open-Data-Collective/irs-efile-master-concordance-file/pull/11/commits/1cbb52a272b55b21ea89563e7470f810e4b05b62

jsfenfen commented 6 years ago

closing, will handle remnants in #12