DistributedProofreaders / guiguts

Perl/Tk text editor designed for editing and formatting public domain material for inclusion at Project Gutenberg
GNU General Public License v2.0
9 stars 10 forks source link

The HTML Generator is omitting some pagenums #1272

Closed charliehoward4dp closed 11 months ago

charliehoward4dp commented 1 year ago

The HTML Generator omitted several <span class="pagenum"....>'s. All of the page numbers seem to be present in the .bin of the text file the Generator used, and I think all of them followed "no count" pages originally containing mid-paragraph illustrations.

To recreate the problem, open "h01-pregen.html" (actually a .txt file, of course) in GG and run the HTML Generator.

Before doing so, you may want to verify that the soon-to-be-missing page numbers are present in the status bar for "h01-pregen.html" and, of course, in its .bin.

Here is a list of the omitted page numbers. The list was created by running "HTML Link Check" after running the HTML Generator:

Beginning check: Link Check
+#Page_105: Internal link without anchor
+#Page_11: Internal link without anchor
+#Page_123: Internal link without anchor
+#Page_127: Internal link without anchor
+#Page_129: Internal link without anchor
+#Page_135: Internal link without anchor
+#Page_159: Internal link without anchor
+#Page_181: Internal link without anchor
+#Page_193: Internal link without anchor
+#Page_213: Internal link without anchor
+#Page_243: Internal link without anchor
+#Page_247: Internal link without anchor
+#Page_251: Internal link without anchor
+#Page_257: Internal link without anchor
+#Page_263: Internal link without anchor
+#Page_265: Internal link without anchor
+#Page_39: Internal link without anchor
+#Page_59: Internal link without anchor
+#Page_63: Internal link without anchor
+#Page_89: Internal link without anchor
Check is complete: Link Check

The project is "The Macedonian Campaign."

This happens in versions going back to 1.4.0; I didn't check further back than that. The input file to the regression tests was this same "h01-pregen.html," so issues in the .bin likely affected all versions of GG that used it.

Looking in the .bin, the missing page numbers have the same offsets as the two preceding "no count" pages. I believe that, in all of them, the illustrations on those "no count" pages were mid-paragraph, and moved (usually up, towards the beginning of the file) to paragraph breaks.

Looking at the sequence of early steps in preparing the .txt file, I moved the illustrations (using "Illustration Fixup" before running "Configure Page Labels." Could that be the culprit? Should "Configure Page Labels" be run before "Illustration Fixup"?

autogen-pagenum-bug.zip

charliehoward4dp commented 1 year ago

I re-did several "early" steps and found that the sequence in which "Illustration Fixup" and "Configure Page Labels" are done is not the culprit. The seed of what will cause the HTML Generator error is planted in the .bin when the Page Separators are removed. In the step before they are removed, each page descriptor in the .bin immediately following a pair of "no counts" has a larger offset than the two no-count pages. Once the Page Separators are removed, all three have the same offset. The page following the two "no counts" does have a page number in the .bin (as you can see in the supplied .bin file, above), and that page number is displayed in the Status Bar, but it is not used by the HTML Generator.

charliehoward4dp commented 12 months ago

This problem began with version 1.4.0. It does not occur in 1.3.3 or earlier.

Speculation: Perhaps, beginning in 1.4.0, the routine that creates the <span class="pagenum"....> sees "No count" in the .bin and (correctly) does not create a span. But, maybe it also uses the "offset" value and looks for a higher (or different) "offset" value to examine as a candidate for creating another <span>, rather than looking at each subsequent line independently. If that's the logic, it will skip the line following the "No counts", because it has the same offset as the "No counts."

charliehoward4dp commented 12 months ago

Did two more tests, both based on the above speculation:

  1. in the HTML Generator dialog, turned off "skip coincident pg#s". The generated HTML had no link errors, but the ones that had been missing were generated in the wrong places (at the start of the next page: e.g., the pagenums for pages 39, which had been missing, and 40 were generated adjacent to each other, at the start of page 40).

  2. before running the HTML Generator, edited the .bin file (h01-pregen.html.bin) with Notepad++ and removed all "No count" lines whose offsets were identical to the offsets on the next line. Then, ran the HTML Generator, and the results were correct: no "internal link without anchor" messages, and the pagenums were in the correct places.