How to access all of document contents in correct order?

EvotecIT / OfficeIMO

Fast and easy to use cross-platform .NET library that creates or modifies Microsoft Word (DocX) and later also Excel (XLSX) files without installing any software. Library is based on Open XML SDK

MIT License

289 stars 50 forks source link

How to access all of document contents in correct order? #239

Closed ChrisBellBO closed 2 months ago

ChrisBellBO commented 4 months ago

Hi, I'm looking to use OfficeIMO to import Word documents to another format. I've fairly quickly been able to pull in the text content of a document with the correct structure but I'm now looking at importing tables and images (and any other content). I can see WordParagraph has an IsImage property which I think will give me what I want for images. For tables, I can see there is a Tables property on WordSection but nothing on WordParagraph. Is it possible to figure out where the table is within the document? I've looked at using the OpenXML API directly but that has just made me aware of why you developed this!

PrzemyslawKlys commented 4 months ago

It's hard in its current state because of how it was implemented, where Tables and a few other types are stored totally separate from Paragraphs. As in Table exists in Document directly, similar to Paragraphs, but Images, HyperLinks and others are bound to Paragraph. So I basically bound Paragraphs to Section, same with Table and so on, but this causes issues when trying to establish the order.

I believe we need to create another Property which stores all Paragraphs, All Tables etc in single List in proper order as they are read from the word which should help with maintaining the current order without touching current properties.

It shouldn't be too hard to add. Most of the code is ready in:

https://github.com/EvotecIT/OfficeIMO/blob/master/OfficeIMO.Word/WordSection.PrivateMethods.cs

It just requires merging Lists/StBlocks/Paragraphs etc in one big ordered list of things.

Especially since I wanted to create a library DocX to PDF, but without order it's hard ;)

ChrisBellBO commented 4 months ago

This does what I'm after (for the 2 docs I've tested), although it's almost certainly missing some types and it doesn't seem to be doing a lot that the methods in WordSection are doing

public List<object> AllElements() { var list = new List<object>(); foreach (var element in _wordprocessingDocument.MainDocumentPart.Document.Body.ChildElements) { if (element is Paragraph) list.Add(new WordParagraph(this, element as Paragraph)); else if (element is Table) list.Add(new WordTable(this, element as Table)); else if (element is SectionProperties) { // ignore? } else if (element is SdtBlock) { // ignore? } else throw new Exception("Unrecognised type - " + element.GetType().Name); } return list; }

PrzemyslawKlys commented 4 months ago

I think we should start somewhere, and improve over time. The methods in WordSection are simply converting OpenXml objects int officeimo objects for easier management so they convert Paragraph to WordParagraph, which then you can check if it has image, hyperlink or whatever assigned to them.

PrzemyslawKlys commented 2 months ago

Solved by: https://github.com/EvotecIT/OfficeIMO/pull/240