PHPOffice / PHPWord

A pure PHP library for reading and writing word processing documents
https://phpoffice.github.io/PHPWord/
Other
7.29k stars 2.7k forks source link

HTML to OOXML for TemplateProcessor #1366

Open rkorebrits opened 6 years ago

rkorebrits commented 6 years ago

I've been trying to figure out how I can get OOXML from HTML input, to paste this into the TemplateProcessor. So far I haven't found a "direct" method (e.g. htmlToOOXML), but have been trying to parse the HTML first:

$cv = new \PhpOffice\PhpWord\PhpWord();
$section = $cv->addSection();

\PhpOffice\PhpWord\Shared\Html::addHtml(
    $section, $html
);

and then trying to get the OOXML from the section.

With print_r($section->getPhpWord()); I do seem to be getting my HTML in some kind of PHPWord object, but is there a way to just the XML for this part?

troosan commented 6 years ago

you'll have to write your PhpWord document file. If you then want to retrieve the xml you'll have to unzip it parse the XML and take the part you need ... Might be easier to build your document from scratch instead of trying to put all thing in a TemplateProcessor.

rkorebrits commented 6 years ago

Unfortunately I can't go around the TemplateProcessor, the documents I'm working with are very custom and it's not an option to build them from scratch.

When documents are compiled, the object is written to XML, can you give me directions on how I could use this method to parse a section and get the XML from that, would that be possible? I think having a method to parse HTML to OOXML and return it can be very handy in general, I'm currently using https://github.com/rkorebrits/HTMLtoOpenXML, which works to some extent, but it's not great at all; I still need to pre-process the HTML to remove stuff like attributes, etc as they break the output. The HTML option in your lib is way better and would much prefer utilising that.

jeffsrepoaccount commented 6 years ago

@rkorebrits I was able to come up with a subclass of the template processor (gist here) that can replace a placeholder in your template with an OOXML AltChunk (see here) with provided markdown. AltChunks initially source their content from a separate file in the archive, but implementing consumers (e.g. MS Word) will pull the content in, convert it to OOXML and replace the content after the document is first opened. Also see this YouTube video by Eric White. HTH.

rkorebrits commented 6 years ago

Thanks @jeffsrepoaccount However I'm not sure if I'm able to use that. We have a bunch of templates, with each multiple repeating blocks containing HTML from TinyMCE (only bold,italic and lists) this is put in by users, which is not markdown. So really need a HTML to OOXML processor. Not sure if I'm missing something, but I don't think your gist provides a solution for this? Thanks anyway!

jeffsrepoaccount commented 6 years ago

@rkorebrits In the gist the markdown gets converted to HTML (AltChunks support text/html content types, but not text/markdown) and written to a file stored inside of the zip archive. I think all you would need to do is remove the markdown conversion and just inject your HTML markup.

You should be aware of my comment underneath the gist. Simply typing the placeholder search value (like ${replaceMe}) in your template and saving it through Word probably won't be sufficient, since it will in all likelihood wind up inside of a text run element and replacing it with an alt chunk there violates the OOXML schema. In my templates I had to manually edit the document.xml inside the archive to ensure the alt chunks would be placed where they would be valid (which is tedious and far from ideal, but works).

keepthinking commented 6 years ago

Hi @rkorebrits,

I have been trying to use your script https://github.com/rkorebrits/HTMLtoOpenXML with the TemplateProcessor, but when I use the fromHTML() method with HTML content and then send the content to the template using setValue(), I don't get the formatted text - just the OpenXML text, like:

<w:p><w:r><w:t xml:space='preserve'>Bernd </w:t></w:r><w:r><w:rPr><w:i/></w:rPr><w:t xml:space='preserve'>and</w:t></w:r><w:r><w:rPr></w:rPr><w:t xml:space='preserve'> Hilla </w:t></w:r><w:r><w:rPr><w:b/></w:rPr><w:t xml:space='preserve'>Becher and more</w:t></w:r><w:r><w:rPr></w:rPr><w:t xml:space='preserve'></w:t></w:r></w:p>

This happens for both table cells as well as regular fields/variables.

Am I missing something?

Thanks Cristiano

rkorebrits commented 6 years ago

Hi @keepthinking

You will need to do something along the lines of:

\PhpOffice\PhpWord\Settings::setOutputEscapingEnabled(false);
$this->_template->setValue($field, $html, $limit);
\PhpOffice\PhpWord\Settings::setOutputEscapingEnabled(true);

That library just converts HTML to OOXML, it's not built only for PhpWord integration, so you will need to disable escaping before inserting it.

keepthinking commented 6 years ago

Hi @rkorebrits thank you for your input. I tried doing what you suggested:

\PhpOffice\PhpWord\Settings::setOutputEscapingEnabled(false); $phpWord->setValue($variable, "<p><strong>Test</strong></p>"); \PhpOffice\PhpWord\Settings::setOutputEscapingEnabled(true); No errors, but when I open the resulting Word Document in Office, Word cannot open it because of 'invalid characters'.

Did you use the library successfully with PHPWord?

Thanks again, Cristiano

image

rkorebrits commented 6 years ago

@keepthinking

Yeah you need to combine it with that library of mine, they are both separate tools.

$parser = new \HTMLtoOpenXML\Parser();
\PhpOffice\PhpWord\Settings::setOutputEscapingEnabled(false);
$ooXml = $parser->fromHTML('<p><strong>Test</strong></p>');
$phpWord->setValue($variable, $ooXml);
\PhpOffice\PhpWord\Settings::setOutputEscapingEnabled(true); 

Did you use the library successfully with PHPWord?

Very, loads of files were created with this library and it can do quite a bit of basic stuff. Especially nested lists was a lot of work, but works good now.

keepthinking commented 6 years ago

Thanks. Sorry if didn’t make it clear, but I am trying to use your library with phpword, specifically with the template processor.

The code I pasted is from a more complex scripts that generates a document based on a template. But using your parser and switching escape off, the send the html using your tool, and then immediately on again, the generated document gets corrupted, as per the screenshot.

I am wondering if there’s something obvious I’m missing?

Best. Cristiano

rkorebrits commented 6 years ago

Send some of your code and the HTML you are trying to send in? Must be missing something obvious.. are you sure it doesn't break without adding in the HTML?

keepthinking commented 6 years ago

Thanks Richard. I really appreciate your help. To answer your question, the templates are perfectly filled if I do not try to insert HTML using your library (i.e. using strip_tags).

My code is quite complex and it queries a database to get data. I have greatly simplified it here (removed 95% of it) and just included what is needed to test the behaviour.

The file is in the context of the Samples directly of PHPWord, so it uses its header and footer. If I send just two variables ($title in HTML and $exhibition in plan text), the resulting document is broken.

Any help as to what I am doing wrong would be greatly welcome.

Best Cristiano Archive.zip

rkorebrits commented 6 years ago

Okay, so I downloaded your script. It seems like you can't combine injected HTML with plain text in word on the same line. Try it with this file:

Sample_00_3_html-template.docx

I know from experience that you can put HTML in a table just fine, and on a single row, but it seems you can't combine HTML with plain text in the word document.

It makes sense actually, especially when you are entering a paragraph, but expect more copy on the same line. When my library generates the OOXML output from HTML, it creates a new Word paragraph

rkorebrits commented 6 years ago

Did that work @keepthinking ?

beard7 commented 6 years ago

@rkorebrits

That works for me! Brilliant.

Now I just need to start looking through the HTMLtoOpenXML source to see how I can support more HTML (such as unordered lists, colours, font sizes etc.).

keepthinking commented 6 years ago

Dear @rkorebrits ,

apologies for the delay - my attention was diverted elsewhere. Thank you and yes, your example makes sense - so on the one line, it's either ALL plain text or ALL HTML - and the same applied to table cells (tested).

It would be great to have a way to combine the two, for flexibility, but for now we can work around it.

Did you notice that <ul> elements are not supported and get converted to numbered lists, and with continuous numbering? Not sure if that's intentional.

Best Cristiano

rkorebrits commented 6 years ago

@beard7 @keepthinking Numbering is a whole new story. The styling for the numbering is set in numbering.xml. It is not possible to set the numbering anywhere else. What I used to do is first create a document with 1 list style, unzip the document and make a copy of numbering.xml , then duplicate the style block that you want in the file and copy the xml file back into your template later. A lot of work :-)

beard7 commented 5 years ago

@rkorebrits I've been experimenting with inserting HTML into template using your OpenXML parser and it's generally working really well.

However, I've now hit a bit of a snag. The documents containing the HTML -> OpenXML content open perfectly well in Word, but the parsed content is missing when the same document is opened in LibreOffice (and OpenOffice).

This wouldn't normally be an issue, but I'm trying to develop a system to convert the documents to PDF on-the-fly using a headless LibreOffice. This mostly works really well, but the resulting PDFs are missing the same content.

I've noticed that if I re-save the documents using Word in Strict Open XML format, they are then perfectly formed in LibreOffice. So I tried saving the template in Strict Open XML format, but that doesn't help.

I guess this is somewhat beyond the scope of this issue, but I'm just looking for pointers.

Thanks

rkorebrits commented 5 years ago

@beard7 yeah printing to PDF doesn't work well, but that's just due to the fact that Libre and OO don't support a lot of stuff. We dropped the print-to-pdf support quite quickly as our users were all on Windows, so they had to do print to pdf from Ms Word

sebgam commented 5 years ago

@rkorebrits thank you very much your tool worked perfect is a lot of help I send a giant greeting from Colombia

rkorebrits commented 5 years ago

@sebgam I'm glad it helped!

rahal commented 5 years ago

I could convert the html to ooxml using this function

`

use PhpOffice\PhpWord\Settings as WordSettings; use PhpOffice\Common\XMLWriter; use PhpOffice\PhpWord\Writer\Word2007\Element\Container;

function getSectionContent($section)
{
    $xmlWriter = new XMLWriter(XMLWriter::STORAGE_MEMORY, './', WordSettings::hasCompatibility());
    $containerWriter = new Container($xmlWriter, $section);
    $containerWriter->write();
    return $xmlWriter->getData();
}

`

But I have the same issue as @beard7 , the document doesn't work in libreoffice.

I imported it to office.live.com and it was weird, I could see my content in the preview but not when I opened the file, I could also share the document ( read-only share link ) and it worked great ( I had the headings and all the elements supported by phpword ) ... Crazy

I don't have Ms Word so I couldn't test it.

harbvxz commented 4 years ago

Hello @rkorebrits,

I think this one can solve your problem.

https://blog.mayflower.de/6699-phpword-create-documents.html

barsproger commented 4 years ago

@keepthinking

Yeah you need to combine it with that library of mine, they are both separate tools.

$parser = new \HTMLtoOpenXML\Parser();
\PhpOffice\PhpWord\Settings::setOutputEscapingEnabled(false);
$ooXml = $parser->fromHTML('<p><strong>Test</strong></p>');
$phpWord->setValue($variable, $ooXml);
\PhpOffice\PhpWord\Settings::setOutputEscapingEnabled(true); 

Did you use the library successfully with PHPWord?

Very, loads of files were created with this library and it can do quite a bit of basic stuff. Especially nested lists was a lot of work, but works good now.

I solved similar task by your desicion and it did`t work. I had "sex" two hour for debug What happened. I used direct output to browser, generated files. The official receipt https://phpword.readthedocs.io/en/latest/recipes.html#download-the-produced-file-automatically is cuts out my OOXML !!!.

$templateProcessor -> save(); - File good

// Later $xmlWriter = \PhpOffice\PhpWord\IOFactory::createWriter($phpWord, 'Word2007'); $xmlWriter->save("php://output"); - Fail

I solved simple, echo file_get_contents(), it`s work fine.

P.S. I`m beginer in English.

strtob commented 2 years ago

Hi everybody,

I've tried to insert a simple stupid html list (<ul><li>) with this code:

$parser = new \HTMLtoOpenXML\Parser();

            \PhpOffice\PhpWord\Settings::setOutputEscapingEnabled(false);
            $ooXml = $parser->fromHTML($value);
            $this->t->setValue($key, $ooXml);
            \PhpOffice\PhpWord\Settings::setOutputEscapingEnabled(true);

the $ooXml output is this:

<w:p><w:r><w:t xml:space='preserve'></w:t></w:r></w:p><w:p><w:pPr><w:pStyle w:val='ListParagraph'/><w:numPr><w:ilvl w:val='0'/><w:numId w:val='1'/></w:numPr><w:rPr></w:rPr></w:pPr><w:r><w:rPr></w:rPr><w:t xml:space='preserve'>first text</w:t></w:r></w:p><w:p><w:pPr><w:pStyle w:val='ListParagraph'/><w:numPr><w:ilvl w:val='0'/><w:numId w:val='1'/></w:numPr><w:rPr></w:rPr></w:pPr><w:r><w:rPr></w:rPr><w:t xml:space='preserve'>secone text</w:t></w:r></w:p><w:p><w:r><w:t xml:space='preserve'></w:t></w:r></w:p>

The text is not shown in word (MS & Libre).

I've tried for hours - has anybody an idea what's the problem??? :-(((

Thanks, Toby

fhumanes commented 1 year ago

I have partially resolved in PHPWord 1.0.0 version in this way:


// ------------------------------HTML "expose" -------------------------------------
$phpWord = new \PhpOffice\PhpWord\PhpWord();
$section = $phpWord->addSection();
\PhpOffice\PhpWord\Shared\Html::addHtml($section, $data['expose'], false, false);
$elements_ar = $section->getElements();
$count = count($elements_ar); // Número de elementos generados por el HTML
$templateProcessor->cloneBlock('BEXPOSE',$count, true, true);

for ($i = 1; $i <= $count; $i++) {
    $tag = 'expose#'.$i;
    $templateProcessor->setComplexBlock($tag , $elements_ar[$i-1]);
}

For each of the paragraphs of the HTML creates a "element" object, so you have to clone the label where the HTML content is sent.

Template: imagen

veewee commented 11 months ago

@fhumanes Thanks for the example. It's a bit cumbersome, but it does the trick!

(I've changed the implementation to setComplexValue in my end for better results.)

It would be nice though, if it could somehow use a template value directly instead of wrapping a template block. Maybe with a nice little shortcut function in the processor - like $templateProcessor->setHtmlValue($search, $html);

Do you think something like this could be possible?

danielinacio commented 10 months ago

$parser = new \HTMLtoOpenXML\Parser(); \PhpOffice\PhpWord\Settings::setOutputEscapingEnabled(false); $ooXml = $parser->fromHTML('

Test

'); $phpWord->setValue($variable, $ooXml); \PhpOffice\PhpWord\Settings::setOutputEscapingEnabled(true);

Without a doubt, this is the best alternative without having to use those tables and sessions! You are great