EvotecIT / OfficeIMO

Fast and easy to use cross-platform .NET library that creates or modifies Microsoft Word (DocX) and later also Excel (XLSX) files without installing any software. Library is based on Open XML SDK
MIT License
261 stars 47 forks source link

Any way to insert rich text HTML content? #203

Closed tmpmachine closed 5 months ago

tmpmachine commented 5 months ago

I have a requirement to insert rich text content, and currently trying html agility pack by parsing and traversing the DOM.

Just to confirm, OfficeIMO still don't have a feature to insert HTML as word document elements, right? I've checked embedding, but it seems only inserting the plain text of the .html file.

If so, am I on the right path? Is there any other way than traversing manually? maybe any known works or solution that's compatible with OfficeIMO?

PrzemyslawKlys commented 5 months ago
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;
using OfficeIMO.Word;

namespace OfficeIMO.Examples.Word {
    internal static partial class Embed {

        public static void Example_EmbedFileHTML(string folderPath, string templateFolder, bool openWord) {
            Console.WriteLine("[*] Creating standard document with embedded HTML file");
            string filePath = System.IO.Path.Combine(folderPath, "EmbeddedFileHTML.docx");
            string htmlFilePath = System.IO.Path.Combine(templateFolder, "SampleFileHTML.html");
            using (WordDocument document = WordDocument.Create(filePath)) {
                Console.WriteLine("Embedded documents in word: " + document.EmbeddedDocuments.Count);
                Console.WriteLine("Embedded documents in Section 0: " + document.Sections[0].EmbeddedDocuments.Count);

                document.AddParagraph("Add HTML document in DOCX");

                document.AddSection();

                Console.WriteLine("Embedded documents in Section 1: " + document.Sections[1].EmbeddedDocuments.Count);

                document.AddEmbeddedDocument(htmlFilePath);

                document.EmbeddedDocuments[0].Save("C:\\TEMP\\EmbeddedFileHTML.html");

                Console.WriteLine("Embedded documents in word: " + document.EmbeddedDocuments.Count);
                Console.WriteLine("Embedded documents in Section 0: " + document.Sections[0].EmbeddedDocuments.Count);
                Console.WriteLine("Embedded documents in Section 1: " + document.Sections[1].EmbeddedDocuments.Count);
                Console.WriteLine("Content type: " + document.EmbeddedDocuments[0].ContentType);

                document.Save(openWord);
            }
        }
    }
}

This worked for me when I tried it.

tmpmachine commented 5 months ago

Well, yeah, but I was expecting the content to be rendered like the second line here: image

.. or is it just me? I'm using WPS office, don't have ms word

PrzemyslawKlys commented 5 months ago

I tested it on Word and had this working.

When I run:

image

This is the HTML it's embedding:

image

And this is Word:

image

image

So it clearly works in Word. Keep in mind that embedding is just putting it in special structure in XML and then the whole "hard work" is done by Word when displaying it. Maybe there's a problem that WPS Office requires some changes to "trigger" that embedding.

PrzemyslawKlys commented 5 months ago

For example notice that some things require special fixes for it to open properly

Maybe you could create some word document in wps office (whatver that is) and compare differences

tmpmachine commented 5 months ago

So it clearly works in Word. Keep in mind that embedding is just putting it in special structure in XML and then the whole "hard work" is done by Word when displaying it. Maybe there's a problem that WPS Office requires some changes to "trigger" that embedding.

That must be it. I asked a friend to open a file and it require installing some plugins or something, and decided not to go with embedding.

There's this library than can convert HTML to openXML: html2openxml.

The elements collection being parsed by html2openxml is somewhat connected to DocumentFormat.OpenXml.Wordprocessing.Paragraph.

image

string filepen = $"C:\\Users\\tmp7\\Desktop\\penman-{Guid.NewGuid().ToString()}.docx";
using (var package = WordprocessingDocument.Create(filepen, WordprocessingDocumentType.Document))
{
    package.AddMainDocumentPart();

    var mainPart = package.MainDocumentPart;
    mainPart.Document = new Document();
    var body = new Body();

    var sectionProp = new SectionProperties();
    var pageSetup = new PageMargin() { Top = 1701, Left = 1134, Right = 1134, Bottom = 850 };
    sectionProp.Append(pageSetup);

    body.Append(sectionProp);

    var converter = new HtmlConverter(mainPart);

    var para = converter.Parse(htmlText);

    var runProp = new RunProperties();
    runProp.Append(new Bold(), new FontSize() { Val = "32" });

    var paragProp = new ParagraphProperties();
    var justif = new Justification() { Val = JustificationValues.Center };
    paragProp.Append(justif);

    foreach (var item in para)
    {
        // <-- item is somewhat connected to DocumentFormat.OpenXml.Wordprocessing.Paragraph
        body.Append(item);
    }
    mainPart.Document.Append(body);
}

Then, I found that in WordParagraph.cs, Paragraph is DocumentFormat.OpenXml.Wordprocessing.Paragraph.

...
using OfficeMath = DocumentFormat.OpenXml.Math.OfficeMath;
using Paragraph = DocumentFormat.OpenXml.Wordprocessing.Paragraph;
using ParagraphProperties = DocumentFormat.OpenXml.Wordprocessing.ParagraphProperties;
...

I still can't quite figure out the solution how to get these converted elements into WordParagraph. Is there a wayto take the XML and directly appending it to the .docx?

PrzemyslawKlys commented 5 months ago

I know this library and use it in PSWriteOffice in PowerShell.

We do expose WordProcessingDocument in the document, so you can append things directly to body if you wish.

image

In this case Copilot shows a way, but you can just append whatever you create to it.

tmpmachine commented 5 months ago

Awesome! Got it working now, thanks!

Still using html2openxml, though the link elements is not working in my case, could be WPS office only, will check later.

image

using (WordDocument doc = WordDocument.Create(outputPath))
{

    ...

    foreach (var item in para)
    {
        doc._document.MainDocumentPart.Document.Body.Append(item);
    }

    ...

}
tmpmachine commented 4 months ago

For future reference, if anyone looking for a way to append HTML under a list, you can try to create a table, set the table indent, and put the parsed result into the table.

using HtmlToOpenXml;

// ....

// # create a single cell table
Table table = new Table();
var tableProperties = new TableProperties(new TableBorders(new TopBorder(), new BottomBorder(), new LeftBorder(), new RightBorder(), new InsideHorizontalBorder(), new InsideVerticalBorder())) { 
    TableIndentation = new TableIndentation() {
        Width = (int)CentimetersToTwips(1.24), // adjust to the list item indent
    }
};
table.AppendChild(tableProperties);

var row = new TableRow();
var cell = new TableCell();
row.Append(cell);
table.Append(row);

// # append table to document body
doc._document.MainDocumentPart.Document.Body.Append(table);

// # parse html
var converter = new HtmlConverter(doc._document.MainDocumentPart);

foreach (OpenXmlCompositeElement item in converter.Parse(htmlText))
{
  // Retrieve the first row and first cell
  TableRow firstRow = table.Elements<TableRow>().FirstOrDefault();
  TableCell firstCell = firstRow?.Elements<TableCell>().FirstOrDefault();

  // # append to table
  firstCell.Append(item);
}