Closed IanMayo closed 1 year ago
I've been working on this a bit today. I've got a partial fix, but there are still some issues. The fix I've got ensures that all the content is parsed, but it might not end up in the right order.
Specifically: if we have a div (either a PageLayer or a BottomLayer) which has some raw content in it (h1 tags, p tags etc) and some divs with top values, then we don't have a way to make sure it all appears in the correct order.
For example, in data/France1/FR_A27_Unit.html we have something that looks a bit like this (shortened for simplicity):
<div id="PageLayer3" style="position: absolute; width: 952px; height: 527px; left: 30px; top: 0px">
<h1>PROPULSION</h1>
<p> </p>
<p> </p>
<p> </p>
<p>Content here</p>
<div style="position: absolute; left: 91px; top: 27px; width: 781px;">
Image or Table Here
</div>
</div>
The internal div with a top value has a table or an image, which is placed (by the top value) where the p tags with the nbsp values are. That is, the table or image appears above the Content here
bit. However, as we only have top values for one bit of this (the internal div), we don't know where to place it relative to the other content. We have no way of knowing that it is meant to appear below the h1 but above the Content here
bit.
Do you have any suggestions as to what we should do here?
We need to understand how often this pattern appears in the real data. Specifically, we are looking for PageLayer or BottomLayer divs that have multiple tags in them, where not all of them have a top value. If it appears relatively infrequently it might be best to either fix at source (either by adding top values to everything, or just moving the table/image to the right place) or fix in the converted DITA files manually after conversion. If it appears a lot then we'll have to try and come up with some way of dealing with this. For simple cases like this with just one div with a top value we could probably look for the nbsp values and place it in there - but there may well be far more complicated cases with multiple divs with top values, with multiple pieces of content without top values spread around it.
Yes, I agree - out will be a real challenge to solve this programmatically. I'll be happy if our parser reports instances of this behaviour (i.e. content away from the start of a top level div that contains 'top' values), and I correct them by physically moving the content to the correct location. I can determine the destination by looking at the rendered page.
I've added a warning, and tightened up the checks for when we produce the warning so we don't get too many false positives.
We get the following errors for our mock data:
WARNING: Elements with no top value found inside element with ID PageLayer1 in file /home/LegacyMan/data/France1/FR_Transducers.html
WARNING: Elements with no top value found inside element with ID PageLayer2 in file /home/LegacyMan/data/France1/FR_A27_Unit.html
WARNING: Elements with no top value found inside element with ID PageLayer1 in file /home/LegacyMan/data/Wales_Cmplx/WA_transducers.html
WARNING: Elements with no top value found inside element with ID None in file /home/LegacyMan/data/Wales.Legacy/Wales_Legacy.html
WARNING: Elements with no top value found inside element with ID PageLayer1 in file /home/LegacyMan/data/Wales.Legacy/WA_transducers.html
WARNING: Elements with no top value found inside element with ID None in file /home/LegacyMan/data/Wales_Cmplx/Wales_Cmplx1.html
WARNING: Elements with no top value found inside element with ID PageLayer1 in file /home/LegacyMan/data/Britain.Legacy/BR_transducers.html
WARNING: Elements with no top value found inside element with ID PageLayer1 in file /home/LegacyMan/data/Spain.Legacy/SP_transducers.html
WARNING: Elements with no top value found inside element with ID PageLayer1 in file /home/LegacyMan/data/Spain_Cmplx/SP_transducers.html
Most of them are from files with the same basic structure - the XX_transducers.html
file. If you look at those files you'll find they do have mixed content with and without top
values. Eg in data/Britain.Legacy/BR_transducers.html
there is a div called Table
with a top value, and then a blockquote
element with no top value.
Are you happy to look through these and check that a) they're valid warnings, and b) decide how we want to deal with them?
Yes, I agree - it will be a real challenge to solve this programmatically. I'll be happy if our parser reports instances of this behaviour (i.e. content away from the start of a top level div that contains 'top' values), and I correct them by physically moving the content to the correct location. I can determine the destination by looking at the rendered page.
I'm happy to check these instances in our mock data
Hello @robintw - I'm playing with the update. I think our problem is for child elements that do
have top value. For the ones that don't have a top value - we just handle them inline, like we do now. It's the ones with
a top
that I'm expecting to have to handle manually.
Aah, one exception:
Where an outer-level
div has a top (like all 3 of those above) then I can ignore them - since we just make sure the content within that block is consistent.
But if there is another element that is a sister of div id="table"
then I do need to investigate it, since the sister could be located higher up than this one.
Yes, I think that's right - although our code generally expects to find divs with a top value, so it's alerting us when there is something which doesn't have a top value. I think it's warning us about the right places at the moment - do you agree? The problem is when there is a mix of things with top values and things without.
The best way to solve this will depend on the exact situation. A couple of examples:
In general, the vast majority of the mock data (and, I assume, the real data) has top values for everything - so we don't run into this problem.
Yes, I'm handling example 2.
I'm less clear on example 1. Are we re-sequencing children with top
values? I didn't think we were.
I thought we were just using top
of top-level elements to ensure we provided the correct anchor ids, and to correctly sequence floating content at very top level.
Yes, we're re-sequencing children of BottomLayer/PageLayer divs that have top values.
For example, for a structure like:
PageLayer1
div1 - top = 50
div2 - top = 10
div3 - top = 100
We will sort these by top values to put them into the DITA file in the order: div2, div1, div3. If we don't do this then we get images and text in the wrong order in the output.
Oh that's great. Do we do it down through the child elements too? There are two layers of children in the above screenshot.
And - yes I see how I can use example1
to fix top-less child elements (children
just sounded wrong there 🤣 ).
No, we only do it at the level below either PageLayer or BottomLayer
So we will re-order top-level divs, then any children of a PageLayer or BottomLayer div - but not recursively for either of those.
Sure, ok. Let me look out for pattern(s) in the real data.
I've reviewing the top missing
warnings, and the first few I've looked at are for very simple files that just contain a single table:
https://github.com/DeepBlueCLtd/LegacyMan/blob/develop/data/Britain.Legacy/Bravo_Transducer.html#L49
In your description above you explained how the warning could be overcome by adding a top
value. Would I add that to the table
element? Wouldn't the logic carry on descending the tree until found the next child without a top
value?
I think I've got the logic for items that need attention. We descend the PageLayer
tree until we get to a level with multiple children. We throw the warning if these children have a mix of with-top
and without-top
style attributes, but if they are consistently one or the other.
Hello @robintw - I've reproduced a file where the first element isn't being processed. The new release does resolve a number of instances where the first element isn't identified, but this instance still breaks: https://github.com/DeepBlueCLtd/LegacyMan/blob/develop/data/Britain_Cmplx/unit_a28.html
I've fixed that in #531. It was actually a case of the 'shopping list' not having the #### First Page Layer
marker in it, because it hadn't detected a link to that page without a #anchor
in the link. This was because links from category pages weren't being added to the shopping list properly.
We should probably discuss whether we should add a #### First Page Layer
entry to all links that we find. At the moment we don't, we have to have a link to the page without any anchors specified (eg. as #anchor
). This was to deal with bits of pages that aren't linked from anywhere and therefore shouldn't be converted, but I don't know whether this ever applies for the first part of any page. What do you think?
So, with hindsight I should have tested the new PR rather than spent the time mocking up a breaking piece of content 🤣
I look forward to testing this tomorrow. I think "always handle first pagelayer" is a good strategy.. But, I guess we could defer that until I'm able to run more content tests. Each time we pick up more content I have a batch of content fixes to make...
Ah no, I meant I've just fixed it in that PR. It wasn't fixed this morning, don't worry!
Closing this issue. I still have lots of relative/absolute layout issues to resolve, but we're picking up lots of content that was being missed before.
This file has a paragraph of content that starts with an image, then has some paragraphs of content: https://github.com/DeepBlueCLtd/LegacyMan/blob/develop/data/France1/FR_Phase_G_Size.html#L47
But, once parsed to DITA, only the image is present.
Aah, it looks like the issue is here: https://github.com/DeepBlueCLtd/LegacyMan/blob/develop/parser/html_to_dita.py#L109
We're dropping the rest of the content. I guess we should be checking if there is just the single image in the div.
(I know I could (or even should) fix this, but I value having two pairs of eyes looking at the issue)
Update Ian to closely reproduce A27.