DistributedProofreaders / ppwb

Post Processor's Workbench
GNU General Public License v3.0
5 stars 6 forks source link

report parsing errors and exit #13

Closed asylumcs closed 3 years ago

asylumcs commented 3 years ago

I didn't write comp_pp.py, nor could I have written it. But there is an open issue "ppcomp fails to parse html file" and I've put together code to close it out. I'm guessing if the comp_pp.py author had made this patch it would be more elegant. What I do is report anything that can't be parsed and which previously caused the program to error out. I report it and then exit cleanly. It now looks like this:

comp_pp parsing error report: ERR_TAG_NAME_MISMATCH near line 68: Unexpected end tag : hr ERR_TAG_NAME_MISMATCH near line 91: Unexpected end tag : hr exiting

My source file had <hr></hr> on lines 68 and 91 to cause this report. As I said, I'm a out of my league with lxml and this code, so if anyone has a better way, please fix it: you can't hurt my feelings. Otherwise, here is my pull request.

cpeel commented 3 years ago

If the parser fails, erroring out is the correct thing. If we want to include the error log in the output, it would be better to do that at line 222 before the SyntaxError is raised.

asylumcs commented 3 years ago

I don't think the syntax error will be raised if I exit after showing what would cause the parser to fail. I want to show that to the user in the simplest way: source line number, description of error, failing tag. I guess I'd know more about Casey's comment if I could see how the code would change.

cpeel commented 3 years ago

I don't think the syntax error will be raised if I exit after showing what would cause the parser to fail. I want to show that to the user in the simplest way: source line number, description of error, failing tag. I guess I'd know more about Casey's comment if I could see how the code would change.

Right. So we should augment the existing SyntaxError exception with the detailed information you want to include:

diff --git a/bin/comp_pp.py b/bin/comp_pp.py
index 8f29ac3..26cf997 100755
--- a/bin/comp_pp.py
+++ b/bin/comp_pp.py
@@ -219,8 +219,11 @@ class SourceFile(object):
                                       if parser.error_log[0].type != 513]

         if len(self.parser_errlog):
-            raise SyntaxError("Parsing errors in document: " +
-                              os.path.basename(name))
+            errors = []
+            for entry in parser.error_log:
+                errors.append(f"{entry.type_name} near line {entry.line}: {entry.message}")
+            raise SyntaxError(f"Parsing errors in document {os.path.basename(name)}: \n"
+                + "\n".join(errors))

         self.tree = tree.getroottree()
         self.text = text.splitlines()

This provides the more detailed information that you are looking for but still raises the SyntaxError (which we want, because it is an error). Example output:

$ ./comp_pp.py --simple-html ~/The_City_of_Comrades.html  test.html
Traceback (most recent call last):
  File "./comp_pp.py", line 1624, in <module>
    main()
  File "./comp_pp.py", line 1617, in main
    x.simple_html()
  File "./comp_pp.py", line 1435, in simple_html
    f.load(fname)
  File "./comp_pp.py", line 697, in load
    self.myfile.load_xhtml(filename, relax=True)
  File "./comp_pp.py", line 225, in load_xhtml
    raise SyntaxError(f"Parsing errors in document {os.path.basename(name)}: \n"
SyntaxError: Parsing errors in document The_City_of_Comrades.html:
ERR_TAG_NAME_MISMATCH near line 186: Unexpected end tag : hr

Note that we don't include the <br>s because this is a command line tool, not a web tool. If we want to ensure that the newlines are converted to <br>s in the ppwb, we should modify the output of the error thusly:

diff --git a/ppcomp-action.php b/ppcomp-action.php
index f065dd4..5f2ff49 100644
--- a/ppcomp-action.php
+++ b/ppcomp-action.php
@@ -84,6 +84,7 @@ if (file_exists("$workdir/result.html")) {
 if ($reportok) {
     echo "Left click to view. Right click to download.</p>";
 } else {
+    $output = nl2br($output);
     echo "<p>Whoops! Something went wrong and no output was generated.
     The error message was<br/><br/>
     <tt>${output}</tt></p>
asylumcs commented 3 years ago

I understand everything you wrote, I think. With my post-processor hat on, especially a newer one who is likely to to have made this kind of mistake, seeing the stack trace will be confusing and scary. They will think they have broken something in the program for sure. If you don't have a stack trace, they will see "oh, there's a problem in my code on these two lines." That's the message I would want to see.

My thinking was this: I didn't want to raise the SyntaxError exception because I didn't want to emit the stack trace. That said, this should be really rare and if your fixes to the PR will allow us to close this issue, lets do that.

cpeel commented 3 years ago

My thinking was this: I didn't want to raise the SyntaxError exception because I didn't want to emit the stack trace. That said, this should be really rare and if your fixes to the PR will allow us to close this issue, lets do that.

If you only want to show the error and not the stacktrace, the right way to do that is to catch the exception in main() and just output the error and exit with a non-zero exit number. This ensures that the underlying functions and classes correctly raise exceptions on error and relies on the caller to handle them. Having a lower-level function exit on error is surprising and unexpected and is not a best practice.

diff --git a/bin/comp_pp.py b/bin/comp_pp.py
index 8f29ac3..4d0b559 100755
--- a/bin/comp_pp.py
+++ b/bin/comp_pp.py
@@ -1610,12 +1613,16 @@ def main():
     args = parser.parse_args()

     x = CompPP(args)
-    if args.simple_html:
-        x.simple_html()
-    else:
-        _, html_content, fn1, fn2 = x.do_process()
+    try:
+        if args.simple_html:
+            x.simple_html()
+        else:
+            _, html_content, fn1, fn2 = x.do_process()

-        output_html(args, html_content, fn1, fn2)
+            output_html(args, html_content, fn1, fn2)
+    except SyntaxError as exception:
+        print(exception)
+        sys.exit(1)

 if __name__ == '__main__':
     main()
asylumcs commented 3 years ago

"Having a lower-level function exit on error is surprising and unexpected and is not a best practice." Yeah, that felt very amateurish. I like your approach better. Line 222 becomes: raise SyntaxError("Parsing errors in document: " + os.path.basename(name) + f" {parser.error_log[0].type_name} near line {parser.error_log[0].line}: {parser.error_log[0].message}") but the rub there is it only reports the first error. What if I want to accumulate a list before raising the Syntax error?

Anyway with your two changes (after justifiably backing out my attempt), The report file has this: Parsing errors in document: tat5.htm ERR_TAG_NAME_MISMATCH near line 68: Unexpected end tag : hr That's all on one line. I am learning the right way to do things but doing them the wrong way first. But it's all progress. So we are closer to what we need but not there yet.

cpeel commented 3 years ago

"Having a lower-level function exit on error is surprising and unexpected and is not a best practice." Yeah, that felt very amateurish. I like your approach better. Line 222 becomes: raise SyntaxError("Parsing errors in document: " + os.path.basename(name) + f" {parser.error_log[0].type_name} near line {parser.error_log[0].line}: {parser.error_log[0].message}") but the rub there is it only reports the first error. What if I want to accumulate a list before raising the Syntax error?

If you'll look closer at my diff again you'll notice that I build a list of all the errors in a for loop and all of them are included in the exception message.

Anyway with your two changes (after justifiably backing out my attempt), The report file has this: Parsing errors in document: tat5.htm ERR_TAG_NAME_MISMATCH near line 68: Unexpected end tag : hr That's all on one line. I am learning the right way to do things but doing them the wrong way first. But it's all progress. So we are closer to what we need but not there yet.

Look again at my diff and note how the lines are being joined via \n which are included in the error message.

asylumcs commented 3 years ago

Ok I studied what you suggested Casey very carefully. I'm slow to learn sometimes but what a rush when it all made sense. It is clearly what should be done. Thanks for making me smarter. Now what? New pull request? Somehow edit this one? New git user ready to take copious notes. Walk me through on Slack or, if you already have a branch where you've updated the two files, do you just kill this one and merge yours instead. I'm glad we at least have what looks like a proper solution.

cpeel commented 3 years ago

Now what? New pull request? Somehow edit this one?

My suggestion is that we continue to iterate on this branch / pull request. Consider making changes to this branch with additional commits with the changes that you want. I will squash all the commits together into a single one when I merge it in so we're free to work on it without worrying about a single commit being "the one".

So that's some flavor of:

git checkout pp_comp_xml
# make changes to files
git add <some set of files>
git commit
git push origin pp_como_xml

After pushing up, ensure that the "Files changed" view of the MR only contains the changes that you intend (this shows just a diff of your branch vs what's in master already -- should be the same as git diff origin/master..pp_comp_xml).

I'm happy to help via Slack, just note that I have a full day of meetings so my responses may be erratic.

asylumcs commented 3 years ago

I did another commit of what I think is the final solution for this. Two files were changed (I hope). I'm not super confident with my coding skills on this one or my git skills to get it ready to merge, but it does seem to work when I test it locally.

cpeel commented 3 years ago

Your changes were spot-on Roger. I, however, made an incorrect assumption about how the error would surface from comp_pp.py. We were combing stdout and stderr together, so the error was being output to the result file and not shown to the user as an error.

I've added a second commit that breaks out the stderr from the stdout and if there is an error we now correctly show it to the user rather than writing it to the results file. My changes are in ppcomp-action.php and base.inc if you want to look at them.

This can be tested in my sandbox at https://www.pgdp.org/~cpeel/ppwb/ppcomp.php