Closed rlskoeser closed 4 years ago
Addressed in https://github.com/Princeton-CDH/pemm-data/pull/6
@elambrinaki This one is tricky to test. Don't feel limited by the checklist I've provided below.
Currently test messages are going to the ~#slack-webhook-test~ #pemm channel in the CDH slack.
error-summary
under "Artifacts" after following the link.error-summary
is clear and properly matches the output in the Slack message@kmcelwee To check the format of error-summary
matches the Slack message, I want to view the goodtables reports with the new setup of 20 error messages shown. How do I run the test? (in the #pemm slack channel, there are links to reports with 5 error messages)
@elambrinaki I proposed the change to 20 on a branch, but just waiting on approval from @rlskoeser. Rebecca, for something small like this, should I merge w/o your approval? Or is it always safe to just wait?
@kmcelwee I apologize, I didn't see a review request. For trivial changes, I'm not even sure they need a PR. Changing a config value like this without changing any code logic doesn't require review.
When goodtables is fed an improperly formatted JSON, it exits as if everything passed, creates the output file saying it has 0 errors, and just outputs a warning in its output json:
"tables": [],
"warnings": [
"Data Package \"datapackage.json\" has a loading error \"Unable to parse JSON at \"datapackage.json\". Expecting ',' delimiter: line 596 column 21 (char 23272)\""
],
Option 1 Precede everything with the following command:
python -m json.tool < datapackage.json
This would stop the validation from executing. But no messages would be sent to Slack unless it was a pull request, but Evgeniia will probably only do direct commits.
The errors are output to the shell, and while they're viewable, you'd have to know where to look.
Option 2 Or we could look for the warning under "warnings" in the output JSON, and send a slack message that the JSON wasn't formatted correctly.
But I even went into goodtables's doumentation and couldn't find where or how they list their warnings. It's going to be something messy like:
if error_json.get('warnings', None):
if any([warning.startswith('Data Package "datapackage.json" has a loading error') for warning in error_json['warnings']]):
slack_message = '\n'.join(error_json['warnings'])
@rlskoeser thoughts?
Wow, their warning output really isn't that great, is it. And why is that a warning and not an outright error?
@kmcelwee could you revise your python script to detect nonstandard json output (absence of any validation information, I maybe) and then check the datapackage.json validity and send a slack alert with the json parsing error? (Probably any case where we don't get validation output should be reported as an error / failure)
@kmcelwee Thank you for finding my error!
@kmcelwee Thank you very much for setting up the goodtables reports! With their help, I corrected several errors on the Manuscript and Canonical Story sheets. For these two sheets, I confirm the error-summary
is clear and matches the Slack message.
For the Story Instance with 765 errors, having 20 errors shown in the report is not enough. Is it possible to increase the number of errors shown for a specific sheet only? If so, I would like to view 1,000 errors on the Story Instance sheet and 20 errors on other sheets. Alternatively, is it possible to change the order in which the errors are reported from extra-header, duplicate-row, type-or-format-error, pattern-constraint to type-or-format-error, pattern-constraint, extra-header, duplicate-row,
A clarification regarding a big number of duplicate rows on the Story Instance sheet: for the mss that are currently being cataloged, I added identical rows with ms name and formulas, but without unique information (ID, folios, etc.) -- this will be done by catalogers. So duplicate rows are expected for the current stage of work.
@elambrinaki I noticed this when I was starting to setup the validation. Two options that have occurred to me so far:
@kmcelwee have you thought of any other ways to handle this?
FWIW, I prefer option 1.
@rlskoeser @kmcelwee Is increasing the number of reported errors to 1000 a possible solution?
@elambrinaki oh, I didn't comment on that part of your request but Kevin and I discussed it on Slack! Sorry. I don't think the number of errors reported in the build should be capped. Kevin's going to work on making that config optional and then turn it off for you, so you'll see all the errors. (FWIW, I don't want him spending additional dev time on sorting the errors.)
@elambrinaki @rlskoeser I've just addressed ERROR_MAX in my latest commits to https://github.com/Princeton-CDH/pemm-data/pull/12 ✅
@rlskoeser @kmcelwee Perfect! Thank you very much!
dev notes