PHPOffice / PHPWord

A pure PHP library for reading and writing word processing documents
https://phpoffice.github.io/PHPWord/
Other
7.27k stars 2.7k forks source link

Block clone suddenly stopped working after applying formatting #867

Open judgej opened 8 years ago

judgej commented 8 years ago

I had a simple block that covered an entire page, and was cloned N times right at the start before substitution variables were filled in with data:

${PAGE}
...content with ${variables}...
<page break>
${/PAGE}

Opening as a template, cloning, putting in data and saving working great.

Then I tried adding some formatting to the source document, starting by setting the entire document to Arial, and the ${PAGE} tags are no longer recognised at all.

Looking in the template source, the lines with the tags had font formatting applied, so I removed that (using the ctrl-space technique that has been around since MS Word 2.0). This gives me a source now without the formatting, that looks like this (saved from MS Word):

<?xml ...?>
    <w:document ...>
        <w:body>
            <w:p ...>
                <w:r>
                    <w:t>${PAGE}</w:t>
                </w:r>
            </w:p>
            ...content...
            <w:r w:rsidR="00D212DB">
                <w:lastRenderedPageBreak/>
                <w:t>${/PAGE}</w:t>
            </w:r>
    ...
--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/37065697-block-clone-suddenly-stopped-working-after-applying-formatting?utm_campaign=plugin&utm_content=tracker%2F323108&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F323108&utm_medium=issues&utm_source=github).
judgej commented 8 years ago

The cloneBlock() method of the template processor is looking for this:

'/(<\?xml.*)(<w:p.*>\${' . $blockname . '}<\/w:.*?p>)(.*)(<w:p.*\${\/' . $blockname . '}<\/w:.*?p>)/is'

That's crudely looking for a <w:p> element wrapping around the block name tag. It does not work because MS Word wraps it in a pair instead. Except it did match, so it seems removing formatting from a line leaves additional layers of elements in.

judgej commented 8 years ago

So I guess what needs to happen is a search for the clone tags in the XML structure, then walking up the tree to the <w:p> elements, then duplicating those elements and everything within them assuming they have a common parent (which should be <w:body>).

Of course NONE of that can be done with a REGEX. The document source is structured XML and needs to be treated as such.

Here is the source of my document that is failing to match the ${PAGE} tags:

clipboard01

judgej commented 8 years ago

Trying this preg match offline to see if I can see the problem, it seems that it is this at the start that is causing it to not match:

(<\?xml.*)

That has me totally baffled, because that does match everything when used on its own '/(<\?xml.*)/is'. I guess it is being too "greedy" and eating up the whole XML and not leaving anything for the remaining rules to match. Replacing that part of the REGEX with (>) fixes it for me temporarily (I'm a NASTY DIRTY HACK!) but I'm going to have to revisit this fix.

judgej commented 8 years ago

Also finding the regex replacements that setValue() uses can be extremely slow. For a 100 page document, it is taking about 0.25 seconds to replace about 20 substitution variables on each page in the document, so that 100 page document is taking a good 15 seconds to generate. I do think there has to be something faster - perhaps just a plain string replace?

Looking at it, a REGEX replace is performed if only replacing some of the fields, otherwise a string replace is done. Ouch

Edit: I just tried a string replace, using strpos() and substr(), which worked, but took about the same amount of time as the preg_replace(). I guess it involves less complex processing, but then more data to move around (the XML string for the 100 pages is about 8Mbyte, so chopping that into smaller strings around the insert text to replacement it, is not fast, and probably uses more memory). This is what I was trying, just for information:

    protected function setValueForPart($search, $replace, $documentPartXML, $limit)
    {
        // Note: we can't use the same function for both cases here, because of performance considerations.
        if (self::MAXIMUM_REPLACEMENTS_DEFAULT === $limit) {
            return str_replace($search, $replace, $documentPartXML);
        } else {
            if (is_string($documentPartXML)) {
                for($i = 1; $i <= $limit; $i++) {
                    $pos = strpos($documentPartXML, $search);

                    if ($pos === false) {
                        return $documentPartXML;
                    }

                    // Chop the string before and after the search text, then join it back
                    // together around the replace text. Still slow for 8M documents!
                    $documentPartXML = substr($documentPartXML, 0, $pos)
                        . $replace .
                        substr($documentPartXML, $pos + strlen($search));
                }

                return $documentPartXML;
            }

            $regExpEscaper = new RegExp();
            return preg_replace($regExpEscaper->escape($search), $replace, $documentPartXML, $limit);
        }
    }
FBnil commented 7 years ago

@judgej About the <xml matching, it hooks there to match only once (preg_match() otherwise would match multiple times). The xml line is in a separate \n line, thus . does not match that (/s for single line fixes that). So Write '/(<\?xml[\s\S]*) instead. I'm planning to remove the regexp for blocks out all together.

But the solution for your block cloning problem is to delete the closing tag, press enter, and write it again. This will ensure it being again inside 1 <w:p>.

JackEllis commented 7 years ago

@FBnil Big fan of you right now, your code fixed the clone block for me!

heroyan commented 5 years ago

I enconter this problem