DeepBlueCLtd / LegacyMan

Legacy content for Field Service Manual
https://deepbluecltd.github.io/LegacyMan/index.html
Apache License 2.0
2 stars 0 forks source link

Image/Tables floating in whitespace #636

Closed IanMayo closed 7 months ago

IanMayo commented 7 months ago

The first pattern is a simple one. (Update: oops, not it isn't).

An image is "floated above" the content in a div or span. But, in the html the image is defined at the correct logical location, followed by a block of whitespan.

We should identify this as follows:

Hmm, actually I don't think the above is a satisfactory solution. There's a fair chance that image is actually positioned somewhere else on the page, over a different block of whitespace.

So, sadly, I think we have to handle the generic instance:

  1. find all floating divs/spans (where they are not blacklisted either by the div name or image filename)
  2. walk back up parent tree to generate top coord in the whole-page coord space
  3. order floating div/span contents by top
  4. create list of whitespace blocks within page elements of the the file
  5. create list of top level elements that contain whitespace, generate the top for these parent elements
  6. for each top level elements, find images with top value >= that element top, but < the next element top
  7. sequentially insert grouped images/tables into whitespace for top level blocks, ensuring the calculated top of the image is greater than or equal to the top of the closest absolute-positioned parent of the whitespace.

Note: once images have been inserted into whitespace, other blocks of whitespace should be removed.

A common pattern for the above is demonstrated below. Note how issue 7 above ensures this image is associated with this whitespace. The coords of the whitespace parent (1516px + 0px) is <= the calculated top of the image (1516px + 0px + 73px) https://github.com/DeepBlueCLtd/LegacyMan/blob/9a5790474ba5493d1253397123211213010c8904/data/Britain_Cmplx/unit_a28.html#L302

Oops, I've got to take this back. We do almost handle the above correctly. We remove the floating parent of the image, and put it inline. In the above case, the image is left in the correct place, and we just have the trailing whitespace to remove.

Will carry on looking for instances tomorrow (Tues) that do require positioning to be fixed. I do know there are lots of them - there's a pattern where the images are all declared right at the end of the file.

Aah, actually - the above pattern (image is floated, but inserted into content at the correct logical place) does represent a high proportion of the issues. We have code to remove floating parent for images. Maybe we modify that to check it's roughly in the correct place (a low top coord: under 800px seems to work so far) - and if that's the case we just remove any trailing whitespace. If we defer warn-blank-lines until after the above processing is complete, we should have a lot fewer instances to track. File unit_a28 does represent a valid case for the above.

Ok, France1\FR_A27_Unit.html includes a floating table and an image, where the image is (slightly) in the wrong place. Also Britain_Complex\unit_28.html includes an image that is in the correct place. Oh, and there is a floating propulsion table in Britain.Legacy/unit_delta(13).html

robintw commented 7 months ago

I've done some work on this today. I've taken the "hard" approach of dealing with the generic instance (following the numbered list of steps you gave).

I've tried it on the irst two files you mentioned: France1\FR_A27_Unit.html and Britain_Complex\unit_28.html. I've attached the converted output for those two files to this message. The output in these files is just the original file with the floating elements moved into the right place (and stopped from being floating). There has been no other processing done to these files - they haven't been converted to DITA or anything - it's just the floating->non-floating conversion.

unit_a28_converted.txt FR_A27_Unit_converted.txt

(They're named with .txt as Github won't let me upload files with a .html extension)

Can you have a look and check they look sensible? There shouldn't be any strings of blank elements anywhere inside a PageLayer (there will still be runs of blank elements just within the body element as they give space for the PageLayers to be placed in the right place), and the images/tables should be in the right place. I've looked at them and think they're correct, but it'd be good to get another set of eyes on it.

You mentioned there was a floating table in Britain.Legacy/unit_delta(13).html. My code doesn't find it, and I can't find it manually either - can you check whether this is correct?

Assuming this is all correct, I will try integrating it into the main processing of files.

IanMayo commented 7 months ago

That's great. Thanks Robin.

  1. Conversions. The images are now no longer in divs/spans that are absolute positioned, and the trailing whitespace has been removed. We will eventually wish to remove the div/span surrounding the image, but I suspect our existing logic may handle that. One last item through: while we remove the trailing <p>&nbsp;</p> entries, the newlines are left in the file (see below). Is it possible to remove the newlines too?
  2. Missing floating table. The floating talble in Britain.Legacy\unit_delta(13).html is at line 987.

Newline chars left in file: image

robintw commented 7 months ago

I'll look into removing the newlines - I think those probably come up as NavigableText instances. I'll see what I can do.

The reason that the floating table in unit_delta(13).html wasn't being moved was because it wasn't finding a set of blank

tags to move it to. The HTML looks like this:

      <h1>PROPULSION</h1>
      <p>&nbsp;</p>
      <p>&nbsp;</p>
      <div style="position: absolute; width: 781px; left: 84px; top: 61px">
      </div>
      <p>&nbsp;</p>
      <p>&nbsp;</p><br>
      <p>&nbsp;</p>
      <p>&nbsp;</p>

with the table in the floating div. We look for at least 5 empty

tags next to each other, and here there are 2 before it and 4 after it. What do you think? Should I reduce the number of

tags to look for to 4? or 3?

I'll get on to integrating this into the rest of the code as soon as I can.

IanMayo commented 7 months ago

Hello - yes - let's drop the number of consecutive newlines to 4, please. We may have to drop it to 3 at a later date, but at least we'll have got rid of large block of floating tables.