Nonprofit-Open-Data-Collective / irs-efile-master-concordance-file

The Master Concordance File defines standards and provides documentation necessary to build structured databases from the IRS E-File XML files posted on AWS.
https://nonprofit-open-data-collective.github.io/irs-efile-master-concordance-file/
40 stars 6 forks source link

Clear definition of 'canonical' version #17

Closed jsfenfen closed 6 years ago

jsfenfen commented 6 years ago

Folks have generally said that version 2016v3.0 is the 'canonical' version, but that's obviously not the case for variables that don't exist in 2016v3.0. (It sorta looks like a 2015 version to me, in part because the new header stuff is missing, but I dunno).

For a variable that existed from 2010-2012, what would the canonical version be? Would it be the most recent or the first time it appeared. Has that been delineated, somewhere (and apologies if it has).

If I knew that, I could attach missing location codes, but without knowing which one to use I think you'd have to eyeball it? This doesn't matter 99 of 100 times, but it's annoying to get wrong.

borenstein commented 6 years ago

I'd probably go with the most commonly attested version. You can get that from the "step 2" materials from the in-person validatathon.

-- David Bruce Borenstein, PhD 781.710.2789 (m) https://www.linkedin.com/in/davidborenstein

On Tue, Nov 7, 2017 at 4:09 PM, Jacob Fenton notifications@github.com wrote:

Folks have generally said that version 2016v3.0 is the 'canonical' version, but that's obviously not the case for variables that don't exist in 2016v3.0. (It sorta looks like a 2015 version to me, in part because the new header stuff is missing, but I dunno).

For a variable that existed from 2010-2012, what would the canonical version be? Would it be the most recent or the first time it appeared. Has that been delineated, somewhere (and apologies if it has).

If I knew that, I could attach missing location codes, but without knowing which one to use I think you'd have to eyeball it? This doesn't matter 99 of 100 times, but it's annoying to get wrong.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Nonprofit-Open-Data-Collective/irs-efile-master-concordance-file/issues/17, or mute the thread https://github.com/notifications/unsubscribe-auth/AEPgn58uQ68QeOeR_XXio39JlIfFdx8yks5s0McQgaJpZM4QVc2Z .

jsfenfen commented 6 years ago

Thanks @borenstein! Do you have a link to that? You mean as a way to fix the location codes? I think that would make sense.

That said, I'm skeptical of usage as an algorithmic "deciding factor" because it changes, right [I think that's the sense in which you mean attested]? Moreover, it's not explicit--I have to know all sorts of external stuff rather than picking something that's available with just the data in front of me.

borenstein commented 6 years ago

Oh, I just meant the canonical version should be the most common version. The spreadsheet contains preview data, so I'll email it to you.

-- David Bruce Borenstein, PhD 781.710.2789 (m) https://www.linkedin.com/in/davidborenstein

On Tue, Nov 7, 2017 at 4:58 PM, Jacob Fenton notifications@github.com wrote:

Thanks @borenstein https://github.com/borenstein! Do you have a link to that? You mean as a way to fix the location codes? I think that would make sense.

That said, I'm skeptical of usage as an algorithmic "deciding factor" because it changes, right [I think that's the sense in which you mean attested]? Moreover, it's not explicit--I have to know all sorts of external stuff rather than picking something that's available with just the data in front of me.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Nonprofit-Open-Data-Collective/irs-efile-master-concordance-file/issues/17#issuecomment-342636370, or mute the thread https://github.com/notifications/unsubscribe-auth/AEPgny0r8AoD0frsT3SiuAcyORkrsFkIks5s0NKRgaJpZM4QVc2Z .

jsfenfen commented 6 years ago

My vote's going to be for the last version that a variable appears in. I end up grouping discontinued variables by vintage, in part because that's when I notice them most. But also the test for variable "currency" is if it equals the current version.

lecy commented 6 years ago

Jacob, " the last version that a variable appears" sound right.

The location code is a variable-level (as opposed to xpath-level) attribute to include in the data dictionary so the user can look up the field on the 990 form if necessary.

If you are referencing old forms, just note that somehow.

jsfenfen commented 6 years ago

I think that's clear enough, thanks @lecy

borenstein commented 6 years ago

Just to chime in here--I actually think we should make location code Xpath-specific, or at least have each location for each variable spelled out in some way.

On Nov 8, 2017 1:27 AM, "Jacob Fenton" notifications@github.com wrote:

I think that's clear enough, thanks @lecy https://github.com/lecy

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Nonprofit-Open-Data-Collective/irs-efile-master-concordance-file/issues/17#issuecomment-342723076, or mute the thread https://github.com/notifications/unsubscribe-auth/AEPgn9DQzb9WJ_SzwuV-NZ9zHOTXT5nZks5s0Um5gaJpZM4QVc2Z .

lecy commented 6 years ago

If that is valuable info, we could just add the schema_location attribute as a separate column? Jacob has a nice way to extract these from the schemas.

xpath line_number
/IRS990/AccountantCompileOrReview [AccountantCompileOrReview] Part XI Line 2a
/IRS990/AccountantCompileOrReview [AccountantCompileOrReview] Part XII Line 2a
/IRS990/AccountantCompileOrReviewBasis/FinancialStatementBoth [AccountantCompileOrReviewBasis] Part XII Line 2a; [FinancialStatementBoth] Part XII Lines 2a and 2b
/IRS990/AccountantCompileOrReviewBasis/FinancialStatementConsolidated [AccountantCompileOrReviewBasis] Part XII Line 2a; [FinancialStatementConsolidated] Part XII Lines 2a and 2b
/IRS990/AccountantCompileOrReviewBasis/FinancialStatementSeparate [AccountantCompileOrReviewBasis] Part XII Line 2a; [FinancialStatementSeparate] Part XII Lines 2a and 2b
/IRS990/AccountantCompileOrReviewInd [AccountantCompileOrReviewInd] Part XII Line 2a
borenstein commented 6 years ago

Great, yes, let's please add that. Jacob, is the logic to extract these open-source?

-- David Bruce Borenstein, PhD 781.710.2789 (m) https://www.linkedin.com/in/davidborenstein

On Wed, Nov 8, 2017 at 11:15 AM, Jesse Lecy notifications@github.com wrote:

If that is valuable info, we could just add the schema_location attribute as a separate column? Jacob has a nice way to extract these from the schemas. xpath line_number /IRS990/AccountantCompileOrReview [AccountantCompileOrReview] Part XI Line 2a /IRS990/AccountantCompileOrReview [AccountantCompileOrReview] Part XII Line 2a /IRS990/AccountantCompileOrReviewBasis/FinancialStatementBoth [ AccountantCompileOrReviewBasis] Part XII Line 2a; [FinancialStatementBoth] Part XII Lines 2a and 2b /IRS990/AccountantCompileOrReviewBasis/FinancialStatementConsolidated [ AccountantCompileOrReviewBasis] Part XII Line 2a; [ FinancialStatementConsolidated] Part XII Lines 2a and 2b /IRS990/AccountantCompileOrReviewBasis/FinancialStatementSeparate [ AccountantCompileOrReviewBasis] Part XII Line 2a; [FinancialStatementSeparate] Part XII Lines 2a and 2b /IRS990/AccountantCompileOrReviewInd [AccountantCompileOrReviewInd] Part XII Line 2a

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Nonprofit-Open-Data-Collective/irs-efile-master-concordance-file/issues/17#issuecomment-342868203, or mute the thread https://github.com/notifications/unsubscribe-auth/AEPgn7ijWEM9DAr509B4zYk9y59C7Xtiks5s0dOwgaJpZM4QVc2Z .

jsfenfen commented 6 years ago

Hey @borenstein the logic will be (am travelling till next week), the source files are line_numbers.csv and descriptions.csv here: https://github.com/jsfenfen/shared_irs_docs