Princeton-CDH / pemm-scripts

scripts & tools for the Princeton Ethiopian Miracles of Mary project
Apache License 2.0
1 stars 0 forks source link

As a data manager, I want to see alerts when the data does not pass validation checks so that I can fix validation problems quickly. #72

Closed rlskoeser closed 4 years ago

rlskoeser commented 4 years ago

dev notes

kmcelwee commented 4 years ago

Addressed in https://github.com/Princeton-CDH/pemm-data/pull/6

kmcelwee commented 4 years ago

@elambrinaki This one is tricky to test. Don't feel limited by the checklist I've provided below.

Currently test messages are going to the ~#slack-webhook-test~ #pemm channel in the CDH slack.

elambrinaki commented 4 years ago

@kmcelwee To check the format of error-summary matches the Slack message, I want to view the goodtables reports with the new setup of 20 error messages shown. How do I run the test? (in the #pemm slack channel, there are links to reports with 5 error messages)

kmcelwee commented 4 years ago

@elambrinaki I proposed the change to 20 on a branch, but just waiting on approval from @rlskoeser. Rebecca, for something small like this, should I merge w/o your approval? Or is it always safe to just wait?

rlskoeser commented 4 years ago

@kmcelwee I apologize, I didn't see a review request. For trivial changes, I'm not even sure they need a PR. Changing a config value like this without changing any code logic doesn't require review.

kmcelwee commented 4 years ago

When goodtables is fed an improperly formatted JSON, it exits as if everything passed, creates the output file saying it has 0 errors, and just outputs a warning in its output json:

"tables": [],
    "warnings": [
        "Data Package \"datapackage.json\" has a loading error \"Unable to parse JSON at \"datapackage.json\". Expecting ',' delimiter: line 596 column 21 (char 23272)\""
    ],

Option 1 Precede everything with the following command:

python -m json.tool < datapackage.json 

This would stop the validation from executing. But no messages would be sent to Slack unless it was a pull request, but Evgeniia will probably only do direct commits.

The errors are output to the shell, and while they're viewable, you'd have to know where to look.

Option 2 Or we could look for the warning under "warnings" in the output JSON, and send a slack message that the JSON wasn't formatted correctly.

But I even went into goodtables's doumentation and couldn't find where or how they list their warnings. It's going to be something messy like:

if error_json.get('warnings', None):
     if any([warning.startswith('Data Package "datapackage.json" has a loading error') for warning in error_json['warnings']]):
          slack_message = '\n'.join(error_json['warnings'])

@rlskoeser thoughts?

rlskoeser commented 4 years ago

Wow, their warning output really isn't that great, is it. And why is that a warning and not an outright error?

@kmcelwee could you revise your python script to detect nonstandard json output (absence of any validation information, I maybe) and then check the datapackage.json validity and send a slack alert with the json parsing error? (Probably any case where we don't get validation output should be reported as an error / failure)

elambrinaki commented 4 years ago

@kmcelwee Thank you for finding my error!

elambrinaki commented 4 years ago

@kmcelwee Thank you very much for setting up the goodtables reports! With their help, I corrected several errors on the Manuscript and Canonical Story sheets. For these two sheets, I confirm the error-summary is clear and matches the Slack message.

For the Story Instance with 765 errors, having 20 errors shown in the report is not enough. Is it possible to increase the number of errors shown for a specific sheet only? If so, I would like to view 1,000 errors on the Story Instance sheet and 20 errors on other sheets. Alternatively, is it possible to change the order in which the errors are reported from extra-header, duplicate-row, type-or-format-error, pattern-constraint to type-or-format-error, pattern-constraint, extra-header, duplicate-row,

A clarification regarding a big number of duplicate rows on the Story Instance sheet: for the mss that are currently being cataloged, I added identical rows with ms name and formulas, but without unique information (ID, folios, etc.) -- this will be done by catalogers. So duplicate rows are expected for the current stage of work.

rlskoeser commented 4 years ago

@elambrinaki I noticed this when I was starting to setup the validation. Two options that have occurred to me so far:

  1. Is there a simple way for you to make these identical rows unique? e.g., could you fill in preliminary miracle numbers now that can be checked/refined when they are cataloged?
  2. We can turn off the duplicate row check in the goodtables validation — but as far as I can tell there's no way to do this for a single sheet, we would be turning it off for all sheets. (If we go this route, we'll want to make sure it's configurable so that we can easily turn it off later once those manuscripts have been cataloged.)

@kmcelwee have you thought of any other ways to handle this?

FWIW, I prefer option 1.

elambrinaki commented 4 years ago

@rlskoeser @kmcelwee Is increasing the number of reported errors to 1000 a possible solution?

rlskoeser commented 4 years ago

@elambrinaki oh, I didn't comment on that part of your request but Kevin and I discussed it on Slack! Sorry.  I don't think the number of errors reported in the build should be capped. Kevin's going to work on making that config optional and then turn it off for you, so you'll see all the errors. (FWIW, I don't want him spending additional dev time on sorting the errors.)

kmcelwee commented 4 years ago

@elambrinaki @rlskoeser I've just addressed ERROR_MAX in my latest commits to https://github.com/Princeton-CDH/pemm-data/pull/12

elambrinaki commented 4 years ago

@rlskoeser @kmcelwee Perfect! Thank you very much!

kmcelwee commented 4 years ago

Closed by...