OfficeDev / Open-Xml-PowerTools

MIT License
692 stars 26 forks source link

RevisionAccepter.AcceptRevisions may strip non-content nodes from document (bookmarks & others) #257

Closed DoctorVanGogh closed 5 years ago

DoctorVanGogh commented 6 years ago

It seems RevisionAccepter.AcceptRevisions will 'eat' certain markup fragments. I'd bet this is the same root cause as #227.

Our document fragment was

...
        <w:p w:rsidR="00DB00F9" w:rsidRPr="00DB00F9" w:rsidRDefault="00DB00F9" w:rsidP="00DB00F9">
            <w:pPr>
                <w:rPr>
                    <w:rFonts w:ascii="Arial" w:hAnsi="Arial" w:cs="Arial"/>
                    <w:sz w:val="20"/>
                </w:rPr>
            </w:pPr>
        </w:p>
        <w:bookmarkEnd w:id="0"/>
        <w:bookmarkEnd w:id="7"/>
        <w:p w:rsidR="00DB00F9" w:rsidRPr="00DB00F9" w:rsidRDefault="00DB00F9" w:rsidP="00DB00F9">
            <w:pPr>
                <w:rPr>
                    <w:rFonts w:ascii="Arial" w:hAnsi="Arial" w:cs="Arial"/>
                    <w:sz w:val="20"/>
                </w:rPr>
            </w:pPr>
        </w:p>
...

(Note the two bookmark pointers in between the paragraphs) This will get 'accepted' as:

...
        <w:p>
            <w:pPr>
                <w:rPr>
                    <w:rFonts w:ascii="Arial" w:hAnsi="Arial" w:cs="Arial" />
                    <w:sz w:val="20" />
                </w:rPr>
            </w:pPr>
        </w:p>
        <w:p>
            <w:pPr>
                <w:rPr>
                    <w:rFonts w:ascii="Arial" w:hAnsi="Arial" w:cs="Arial" />
                    <w:sz w:val="20" />
                </w:rPr>
            </w:pPr>
        </w:p>
...

Note the bookmark references are now missing

Some extended breakpoint stepping points to RevisionProcessor.AcceptDeletedAndMoveFromParagraphMarksTransform as the culprit, specifically the loop/logic at https://github.com/OfficeDev/Open-Xml-PowerTools/blob/f7d36ead3345c255f2b48030944009a9f35850ce/OpenXmlPowerTools/RevisionProcessor.cs#L2109 and the grouping logic inside IterateBlockContentElements. Seems this code completely ignores (and skips!) any interleaved non p or tbl tags between elements.

My gut feeling says this requires "chunking" into contiguous block content sequences (broken at "other" element occurrences) before zipping elements up (and I admittedly do not follow the whole paragraph/table combination logic).

ajan1966 commented 6 years ago

Had the same problem. I did some investigation, imho it is sufficent to change line 1861 in RevisionProcessor .FirstOrDefault(e => e.Name == W.p || e.Name == W.tbl); to .FirstOrDefault(); I changed it and my tests did not show any other Problems.

ThomasBarnekow commented 6 years ago

@DoctorVanGogh and @ajan1966, in many cases, the markup is simplified before doing the actual processing, which then does not consider those types of elements that were removed during the markup simplification. You should generally look at whether or not the MarkupSimplifier class is used and, if it is used, check the settings. You might see something like:

var settings = new SimplifyMarkupSettings 
{ 
    RemoveBookmarks = true,
    [...]
};

Next to bookmarks, you might also find that other potentially relevant elements (e.g., content controls, fields, hyperlinks) are removed to make processing easier. This will lead to further issues. For example, removing fields (or field codes) means that all automatic cross-references will be removed, which makes the resulting document useless in many use cases.

DoctorVanGogh commented 6 years ago

@ThomasBarnekow Thanks for the hint, but that's not it. The RevisionAccepter and MarkupSimplifier are distinct pieces of code which do not call into each other. (Although there is some functional overlap insofar as the MarkupSimplifier can strip revision meta information).

Funnily enough our code uses both of those clases. Alas, the issue is in the RevisionAccepter.

ajan1966 commented 6 years ago

RevisionAcceptor calls Line 1322 AcceptRevisionsForPart(OpenXmlPart part), which calls AcceptDeletedAndMoveFromParagraphMarks. AcceptDeletedAndMoveFromParagraphMarks itself calls AcceptDeletedAndMoveFromParagraphMarksTransform. AcceptDeletedAndMoveFromParagraphMarksTransform builds and returns newBlockLevelContainer based on the Collection returned from IterateBlockContentElements. IterateBlockContentElements only respects elements, which are previously annotated via AnnotateBlockContentElements(element); And there it happens: because in Line 1837 and 1861 the processing of annotations is restricted via .FirstOrDefault(e => e.Name == W.p || e.Name == W.tbl); only these two elements get processed further. As far as I understand the code and as far as my testing with various real-life files goes, there are no other side-effects in changing it to .FirstOrDefault(); than the desired effect to keep any element in the document.

tomjebo commented 5 years ago

Closing all issues as this repo is being archived and will no longer be maintained by Microsoft. The project is licensed for continued use and development by forking to your own repo.