Big-Life-Lab / PHES-ODM-Validation

A toolkit to assist in validating whether data conforms to the PHES-ODM dictionary.
https://validate-docs.phes-odm.org/
Creative Commons Attribution 4.0 International
5 stars 0 forks source link

docs: add spec for summary report #182

Closed zargot closed 1 year ago

zargot commented 1 year ago

fixes #74

adds spec:

zargot commented 1 year ago

I should probably include in the spec how valid tables say they are valid, in the summary report, unless that's something only the validation tool needs? (#149)

zargot commented 1 year ago

The module-functions.qmd file looks good but I'm looking for more from the specifications.

Looking for something similar to the specifications for the rules,

  • Text explaining why the function is needed
  • Explaining how it will work with examples that can be used as test cases.

I made an attempt at addressing this now.

Also, this was not in the task but I think we discussed allowing the user to summarize by tables, rows, and/or columns.

I'm not sure about how we would implement this and how useful it would be. We are currently summarizing by tables, which is fine and logical. Summarizing by rows sounds like it would undo the summarization, since aggregating rows is its main feature. I'm not sure about columns. I suppose we could prefix them with table names, so it becomes like 'sites.siteID: : ', otherwise I think it would be confusing. Should we postpone it?

yulric commented 1 year ago

The way I see it, the function would summarize by table by default, but the user would also have the ability to further summarize by rows or columns (but not both). For example, if the addresses table has the following 4 errors,

  1. greater_than_max_length in row 1 in column addressID
  2. missing_value_found in row 1 in column addL1
  3. duplicate_entries_found in rows [1,2] in column addressID
  4. greater_than_max_length in row 2 in column addressID

By default the summary would be,

# Addresses

5 errors

greater_than_max_length:  2
missing_value_found: 1
duplicate_entries_found: 2

If the user wants to summarize also by rows:

# Addresses

## Row 1

3 errors

greater_than_max_length: 1
duplicate_entries_found: 1
missing_value_found: 1

## Row 2

2 errors

duplicate_entries_found: 1
greater_than_max_length: 1

and if the user wants to also summarize by column:

# Addresses

## addressID

3 errors

greater_than_max_length: 2
duplicate_entries_found: 1

## addL1

1 error

missing_value_found: 1

I think this is useful for a user to see if they want it. Especially if a table is very big and they want to drill down a little more into if the errors are localized in a particular column or rows.

As for implementation happy to talk about it if you can't visualize it, I've more experience with these kind of stratifications in other projects.

yulric commented 1 year ago

Also, I took a brief look at your latest commit. Can you also put in examples like in the validation rules, so I can see what the output looks like and also so I know what you will be testing. Specifically,

  1. What the summary object will look like. Ignore the functions in it.
  2. What the summary print out will look like.
zargot commented 1 year ago

@yulric. Sounds good, I've made new changes.

yulric commented 1 year ago

@zargot I pushed a commit to high level spec making things clear. Feel free to rebase things to make it all clean.