docs: add spec for summary report

zargot commented 1 year ago

fixes #74

adds spec:

summary-report.qmd
summarize-report-function.qmd, with new section in module_function.qmd
summarize-tool.qmd
validate-tool.qmd

zargot commented 1 year ago

I should probably include in the spec how valid tables say they are valid, in the summary report, unless that's something only the validation tool needs? (#149)

zargot commented 1 year ago

The module-functions.qmd file looks good but I'm looking for more from the specifications.

Looking for something similar to the specifications for the rules,

Text explaining why the function is needed

Explaining how it will work with examples that can be used as test cases.

I made an attempt at addressing this now.

Also, this was not in the task but I think we discussed allowing the user to summarize by tables, rows, and/or columns.

I'm not sure about how we would implement this and how useful it would be. We are currently summarizing by tables, which is fine and logical. Summarizing by rows sounds like it would undo the summarization, since aggregating rows is its main feature. I'm not sure about columns. I suppose we could prefix them with table names, so it becomes like 'sites.siteID: : ', otherwise I think it would be confusing. Should we postpone it?

yulric commented 1 year ago

The way I see it, the function would summarize by table by default, but the user would also have the ability to further summarize by rows or columns (but not both). For example, if the addresses table has the following 4 errors,

greater_than_max_length in row 1 in column addressID
missing_value_found in row 1 in column addL1
duplicate_entries_found in rows [1,2] in column addressID
greater_than_max_length in row 2 in column addressID

By default the summary would be,

# Addresses

5 errors

greater_than_max_length:  2
missing_value_found: 1
duplicate_entries_found: 2

If the user wants to summarize also by rows:

# Addresses

## Row 1

3 errors

greater_than_max_length: 1
duplicate_entries_found: 1
missing_value_found: 1

## Row 2

2 errors

duplicate_entries_found: 1
greater_than_max_length: 1

and if the user wants to also summarize by column:

# Addresses

## addressID

3 errors

greater_than_max_length: 2
duplicate_entries_found: 1

## addL1

1 error

missing_value_found: 1

I think this is useful for a user to see if they want it. Especially if a table is very big and they want to drill down a little more into if the errors are localized in a particular column or rows.

As for implementation happy to talk about it if you can't visualize it, I've more experience with these kind of stratifications in other projects.

yulric commented 1 year ago

Also, I took a brief look at your latest commit. Can you also put in examples like in the validation rules, so I can see what the output looks like and also so I know what you will be testing. Specifically,

What the summary object will look like. Ignore the functions in it.
What the summary print out will look like.

zargot commented 1 year ago

@yulric. Sounds good, I've made new changes.

yulric commented 1 year ago

@zargot I pushed a commit to high level spec making things clear. Feel free to rebase things to make it all clean.

Big-Life-Lab / PHES-ODM-Validation

docs: add spec for summary report #182