DeepBlueCLtd / LegacyMan

Legacy content for Field Service Manual
https://deepbluecltd.github.io/LegacyMan/index.html
Apache License 2.0
2 stars 0 forks source link

Move floating elements (such as tables and images) to the correct chunk of empty space in the page #637

Closed robintw closed 7 months ago

robintw commented 7 months ago

Currently in-progress and uploaded for testing.

Fixes #636.

IanMayo commented 7 months ago

Have introduced model pattern of data that gives this error when run from run_defloat_conversion.ppy:

no suitable floating elements for blank space at line 418 of data/Fr_Legacy/FR_100_unit_charlie.html

IanMayo commented 7 months ago

Note: I get this pattern hundreds of times:

Warning: no suitable floating elements for blank space at line 123 of /Users/ianmayo/git/LegacyMan/data/Spain.Legacy/sp_charlie_pics.html
Warning: no suitable floating elements for blank space at line 201 of /Users/ianmayo/git/LegacyMan/data/Spain.Legacy/sp_charlie_pics.html
Warning: no suitable floating elements for blank space at line 217 of /Users/ianmayo/git/LegacyMan/data/Spain.Legacy/sp_charlie_pics.html
Warning: no suitable floating elements for blank space at line 233 of /Users/ianmayo/git/LegacyMan/data/Spain.Legacy/sp_charlie_pics.html

It's for pages of images, and a lot of the content sits inside a top level PicLayer element. The SP_Charlie_Pics.dita and the HTML file generated from it are both valid. So, I guess our subsequent logic handles it correctly, but it would seem to be useful if the defloat logic handled it correctly, since that may help in other places.

IanMayo commented 7 months ago

One last item:

[gen-list] file:/Users/ianmayo/git/LegacyMan/target/dita/Spain.Legacy/Bravo_Transducer.dita Line 119:The content of element type "ph" must match "(boolean|cite|keyword|markupname|apiname|option|parmname|cmdname|msgnum|varname|wintitle|numcharref|parameterentity|textentity|xmlatt|xmlelement|xmlnsname|xmlpi|ph|b|i|line-through|overline|sup|sub|tt|u|codeph|synph|filepath|msgph|systemoutput|userinput|menucascade|uicontrol|equation-inline|q|term|abbreviated-form|text|tm|xref|state|data|sort-as|data-about|foreign|svg-container|mathml|unknown|image|draft-comment|fn|indextermref|indexterm|required-cleanup)".

This is related to our extra div elements. The parent one is converted to a ph, but a ph cannot contain a p as in this generated DITA code: image

I expect our fix extra divs/spans will resolve this.

robintw commented 7 months ago

Is this issue in your first comment (Have introduced model pattern of data that gives this error when run from) the same as the issue in the second comment (Note: I get this pattern hundreds of times:)?

The second one occurs because of the way the pics files are written: everything is within a picLayer, and so every bit of repeated blank space is counted as one that we might want to fill (as they're inside a div not just inside the base body tag). Should I stop reporting those errors when they appear within a picLayer div? The same logic is probably producing useful warnings for non-picture files.

IanMayo commented 7 months ago

Is this issue in your first comment (Have introduced model pattern of data that gives this error when run from) the same as the issue in the second comment (Note: I get this pattern hundreds of times:)?

No, they are separate observations. The new instance of mock data throws a single no suitable floating element error. But, it's a different pattern that throws the multiple no suitable floating elements for one file issue. I think they happen in the picLayer content pattern.

Would it be possible for our code to handle the pattern where lots of content is in a top level picLayer element? There are lots of files that match that pattern, and it would be reassuring to know we're handling it via deliberate defloat logic. Aaah no. Maybe our existing functionality is acceptable - where we match floating image elements with the number2 HTML anchor they relate to (according to the shopping list) . If that's the case - then we shouldn't raise a no floating element warning for those files. I'm pretty sure we successfully remove the whitespace from those files too (since it's between PageLayer elements, not inside them.

robintw commented 7 months ago

I've pushed quite a few updates this evening - sorry for my lack of work on this over the last week (other work got quite busy).

I think I've fixed the issues you raised here with the new example, and stopped reporting errors for pics files. I then spent quite a while battling BeautifulSoup while dealing with unit_banjo.html. I think I've solved it now, but it isn't solved in a very nice way.

So, I have a question:

At line 908 of data/Wales_Cmplx/unit_banjo.html there is a BottomLayer3, which doesn't have a PageLayer child. Instead, there is a layer1 child. How common is it for pages to not be in a PageLayer child? And if there isn't a PageLayer child, is it always called something like layer1?

The problem we're running into here (which I sort of described in a comment by the yucky code, but not very well) is that initially we weren't finding the table in that BottomLayer as a floating element, as it wasn't in a PageLayer. But, once we changed our logic to deal with floating elements that aren't inside PageLayers, we picked up this layer1 as the floating element that we need to place. Unfortunately the blank elements that we want to replace with the floating element are inside that floating element - so we end up with circular references and chaos ensues. The yucky code currently searches for these circular references and tries to remove them - which seems to work, but is probably quite slow (particularly when run on the real data). If we could just ignore things like layer1 inside a BottomLayer then that'd be a lot easier (by ignore here, I mean not return layer1 as a potential floating element to be moved, but just any children of layer1 that are floating elements).

I hope that vaguely makes sense, it's late and I'm tired.

Let me know what you think.

IanMayo commented 7 months ago

At line 908 of data/Wales_Cmplx/unit_banjo.html there is a BottomLayer3, which doesn't have a PageLayer child. Instead, there is a layer1 child. How common is it for pages to not be in a PageLayer child? And if there isn't a PageLayer child, is it always called something like layer1?

Now I've gone to inspect that file in the real content, I see I rename layer to PageLayer3, since it was holding me back. I'll look out for any other instances of this pattern.