Content not parsed - Githubissues

IanMayo commented 1 year ago

This file has a paragraph of content that starts with an image, then has some paragraphs of content: https://github.com/DeepBlueCLtd/LegacyMan/blob/develop/data/France1/FR_Phase_G_Size.html#L47

But, once parsed to DITA, only the image is present.

Aah, it looks like the issue is here: https://github.com/DeepBlueCLtd/LegacyMan/blob/develop/parser/html_to_dita.py#L109

We're dropping the rest of the content. I guess we should be checking if there is just the single image in the div.

(I know I could (or even should) fix this, but I value having two pairs of eyes looking at the issue)

Update Ian to closely reproduce A27.

robintw commented 1 year ago

I've been working on this a bit today. I've got a partial fix, but there are still some issues. The fix I've got ensures that all the content is parsed, but it might not end up in the right order.

Specifically: if we have a div (either a PageLayer or a BottomLayer) which has some raw content in it (h1 tags, p tags etc) and some divs with top values, then we don't have a way to make sure it all appears in the correct order.

For example, in data/France1/FR_A27_Unit.html we have something that looks a bit like this (shortened for simplicity):

<div id="PageLayer3" style="position: absolute; width: 952px; height: 527px; left: 30px; top: 0px">
    <h1>PROPULSION</h1>
    <p>&nbsp;</p>
    <p>&nbsp;</p>
    <p>&nbsp;</p>
    <p>Content here</p>
    <div style="position: absolute; left: 91px; top: 27px; width: 781px;">
        Image or Table Here
    </div>
</div>

The internal div with a top value has a table or an image, which is placed (by the top value) where the p tags with the nbsp values are. That is, the table or image appears above the Content here bit. However, as we only have top values for one bit of this (the internal div), we don't know where to place it relative to the other content. We have no way of knowing that it is meant to appear below the h1 but above the Content here bit.

Do you have any suggestions as to what we should do here?

We need to understand how often this pattern appears in the real data. Specifically, we are looking for PageLayer or BottomLayer divs that have multiple tags in them, where not all of them have a top value. If it appears relatively infrequently it might be best to either fix at source (either by adding top values to everything, or just moving the table/image to the right place) or fix in the converted DITA files manually after conversion. If it appears a lot then we'll have to try and come up with some way of dealing with this. For simple cases like this with just one div with a top value we could probably look for the nbsp values and place it in there - but there may well be far more complicated cases with multiple divs with top values, with multiple pieces of content without top values spread around it.

IanMayo commented 1 year ago

Yes, I agree - out will be a real challenge to solve this programmatically. I'll be happy if our parser reports instances of this behaviour (i.e. content away from the start of a top level div that contains 'top' values), and I correct them by physically moving the content to the correct location. I can determine the destination by looking at the rendered page.

robintw commented 1 year ago

I've added a warning, and tightened up the checks for when we produce the warning so we don't get too many false positives.

We get the following errors for our mock data:

WARNING:  Elements with no top value found inside element with ID PageLayer1 in file /home/LegacyMan/data/France1/FR_Transducers.html
WARNING:  Elements with no top value found inside element with ID PageLayer2 in file /home/LegacyMan/data/France1/FR_A27_Unit.html
WARNING:  Elements with no top value found inside element with ID PageLayer1 in file /home/LegacyMan/data/Wales_Cmplx/WA_transducers.html
WARNING:  Elements with no top value found inside element with ID None in file /home/LegacyMan/data/Wales.Legacy/Wales_Legacy.html
WARNING:  Elements with no top value found inside element with ID PageLayer1 in file /home/LegacyMan/data/Wales.Legacy/WA_transducers.html
WARNING:  Elements with no top value found inside element with ID None in file /home/LegacyMan/data/Wales_Cmplx/Wales_Cmplx1.html
WARNING:  Elements with no top value found inside element with ID PageLayer1 in file /home/LegacyMan/data/Britain.Legacy/BR_transducers.html
WARNING:  Elements with no top value found inside element with ID PageLayer1 in file /home/LegacyMan/data/Spain.Legacy/SP_transducers.html
WARNING:  Elements with no top value found inside element with ID PageLayer1 in file /home/LegacyMan/data/Spain_Cmplx/SP_transducers.html

Most of them are from files with the same basic structure - the XX_transducers.html file. If you look at those files you'll find they do have mixed content with and without top values. Eg in data/Britain.Legacy/BR_transducers.html there is a div called Table with a top value, and then a blockquote element with no top value.

Are you happy to look through these and check that a) they're valid warnings, and b) decide how we want to deal with them?

IanMayo commented 1 year ago

Yes, I agree - it will be a real challenge to solve this programmatically. I'll be happy if our parser reports instances of this behaviour (i.e. content away from the start of a top level div that contains 'top' values), and I correct them by physically moving the content to the correct location. I can determine the destination by looking at the rendered page.

IanMayo commented 1 year ago

I'm happy to check these instances in our mock data

IanMayo commented 1 year ago

Hello @robintw - I'm playing with the update. I think our problem is for child elements that do have top value. For the ones that don't have a top value - we just handle them inline, like we do now. It's the ones with a top that I'm expecting to have to handle manually.

Aah, one exception:

Where an outer-level div has a top (like all 3 of those above) then I can ignore them - since we just make sure the content within that block is consistent.

But if there is another element that is a sister of div id="table" then I do need to investigate it, since the sister could be located higher up than this one.

robintw commented 1 year ago

Yes, I think that's right - although our code generally expects to find divs with a top value, so it's alerting us when there is something which doesn't have a top value. I think it's warning us about the right places at the moment - do you agree? The problem is when there is a mix of things with top values and things without.

The best way to solve this will depend on the exact situation. A couple of examples:

You find a div which has 5 children. 4 of them have top values, one of them doesn't. The easiest thing may be to give the thing that doesn't have a top value a sensible top value so it slots in around the others.
You find a div which has 5 children. One of them has a top value, the other 4 don't. The easiest thing would be to remove the top value from the one that has it, and move it to the right place in the content.

In general, the vast majority of the mock data (and, I assume, the real data) has top values for everything - so we don't run into this problem.

IanMayo commented 1 year ago

Yes, I'm handling example 2.

I'm less clear on example 1. Are we re-sequencing children with top values? I didn't think we were.

I thought we were just using top of top-level elements to ensure we provided the correct anchor ids, and to correctly sequence floating content at very top level.

robintw commented 1 year ago

Yes, we're re-sequencing children of BottomLayer/PageLayer divs that have top values.

For example, for a structure like:

PageLayer1
    div1 - top = 50
    div2 - top = 10
    div3 - top = 100

We will sort these by top values to put them into the DITA file in the order: div2, div1, div3. If we don't do this then we get images and text in the wrong order in the output.

IanMayo commented 1 year ago

Oh that's great. Do we do it down through the child elements too? There are two layers of children in the above screenshot.

And - yes I see how I can use example1 to fix top-less child elements (children just sounded wrong there 🤣 ).

robintw commented 1 year ago

No, we only do it at the level below either PageLayer or BottomLayer

So we will re-order top-level divs, then any children of a PageLayer or BottomLayer div - but not recursively for either of those.

IanMayo commented 1 year ago

Sure, ok. Let me look out for pattern(s) in the real data.

IanMayo commented 1 year ago

I've reviewing the top missing warnings, and the first few I've looked at are for very simple files that just contain a single table:

https://github.com/DeepBlueCLtd/LegacyMan/blob/develop/data/Britain.Legacy/Bravo_Transducer.html#L49

In your description above you explained how the warning could be overcome by adding a top value. Would I add that to the table element? Wouldn't the logic carry on descending the tree until found the next child without a top value?

I think I've got the logic for items that need attention. We descend the PageLayer tree until we get to a level with multiple children. We throw the warning if these children have a mix of with-top and without-top style attributes, but if they are consistently one or the other.

IanMayo commented 1 year ago

Hello @robintw - I've reproduced a file where the first element isn't being processed. The new release does resolve a number of instances where the first element isn't identified, but this instance still breaks: https://github.com/DeepBlueCLtd/LegacyMan/blob/develop/data/Britain_Cmplx/unit_a28.html

robintw commented 1 year ago

I've fixed that in #531. It was actually a case of the 'shopping list' not having the #### First Page Layer marker in it, because it hadn't detected a link to that page without a #anchor in the link. This was because links from category pages weren't being added to the shopping list properly.

We should probably discuss whether we should add a #### First Page Layer entry to all links that we find. At the moment we don't, we have to have a link to the page without any anchors specified (eg. as #anchor). This was to deal with bits of pages that aren't linked from anywhere and therefore shouldn't be converted, but I don't know whether this ever applies for the first part of any page. What do you think?

IanMayo commented 1 year ago

So, with hindsight I should have tested the new PR rather than spent the time mocking up a breaking piece of content 🤣

I look forward to testing this tomorrow. I think "always handle first pagelayer" is a good strategy.. But, I guess we could defer that until I'm able to run more content tests. Each time we pick up more content I have a batch of content fixes to make...

robintw commented 1 year ago

Ah no, I meant I've just fixed it in that PR. It wasn't fixed this morning, don't worry!

IanMayo commented 1 year ago

Closing this issue. I still have lots of relative/absolute layout issues to resolve, but we're picking up lots of content that was being missed before.

DeepBlueCLtd / LegacyMan

Content not parsed #518