DeepBlueCLtd / LegacyMan

Legacy content for Field Service Manual
https://deepbluecltd.github.io/LegacyMan/index.html
Apache License 2.0
2 stars 0 forks source link

Parsing pages - process first element first #577

Closed IanMayo closed 9 months ago

IanMayo commented 11 months ago

I'm looking at britain_complx\unit_banjo.html.

In the html file the first BottomLayer is the signatures-table. But in unit_banjo.dita, the Remarks section is first.

I think this is because we are processing the named elements from the shopping-list before the ## first page layer: image

I guess this is because number4 is linked to before any other targets on the page.

When we're collating the shopping lists, can we put '### first page layer at the start? For the unit documents, the first element in the document is always the first one the document viewers see.

robintw commented 10 months ago

I've investigated this, and there's actually a different cause.

We currently take the 'pages' (normally BottomLayer divs, but sometimes other divs) that we find, and sort them by their top value, and put them in the output page in ascending order of the top value.

However, in this case, the top values we've got are:

[('BottomLayer2', 111), ('BottomLayer', 125), ('BottomLayer3', 5353)]

They should be in the order BL, BL2 and BL3. However, BottomLayer2 is a child of GrayLayer3, and so the top value it has is relative to other things in the GrayLayer, not to the page as a whole. In this example, BottomLayer is a direct child of the body element, as is BottomLayer3 - so we've got two top measurements on one scale (the whole page) and one on a different scale (within a GrayLayer), so the sorting isn't giving us the result we want.

Do you have any idea how often this situation occurs? That is, a BottomLayer within some other element, not directly within the body element? I checked a few other files at random and didn't find it, but that doesn't mean it's not present within the real data. The comment in the source file for this example says:

    <!-- this is an example of a BottomLayer appearing inside a GrayLayer, as observed in file A13 -->

We could probably fix this by finding the parent of the BottomLayer that is a direct descendent of the body element, and getting the top value of that - but I'm wondering if that might cause some other problems with separate layers for images etc. Obviously for BottomLayers that are direct children of the body it will behave as it does currently.

What do you think?

IanMayo commented 10 months ago

I think it's sound logic for the top value we store in the dictionary to be the arithmetic sum of the top values of the element we find, and all parent divs that have a top value - because that is effectively how far down the rendered content that the element appears.

When we have a BottomLayer inside a GrayLayer, I'm pretty sure all of the parent divs have a top, but it would be good if the logic allowed for an immediate parent div without a top, but where the ultimate parent does have one.

robintw commented 10 months ago

Ah yes, great idea to sum the top values (why didn't I think of that!). I'll get on that later.

IanMayo commented 9 months ago

Ian to check onsite if this is still an issue, by looking at file a13

IanMayo commented 9 months ago

This is still an issue. I have fixed it in file A13 by moving the block out of the parent block and incrementing the top value by the parent top. It has parsed and published correctly (thought I had to run with no-skip-first-run).

We have two choices:

  1. fix the parser so it correctly vertically locates pages in pages
  2. find way of spotting this pattern happening, and report them (so I can manually fix them).

Option 2 is probably easiest, I don't think its as simple as looking at the order of items in link_tracker.json, since I think they are in order of being encountered, not vertical sequence.

Unit_Banjo in Britain_Complx remains a valid instance of this pattern.

robintw commented 9 months ago

This should be sortable automatically, by using the sum of the top values up the tree. We now have code to do that (for the defloating stuff), so integrating it shouldn't be particularly difficult. I'll do it ASAP, but it may not be until the end of the week.

IanMayo commented 9 months ago

Thanks - that sounds fine. I'll make a note to come back to the issue.