EvotecIT / OfficeIMO

Fast and easy to use cross-platform .NET library that creates or modifies Microsoft Word (DocX) and later also Excel (XLSX) files without installing any software. Library is based on Open XML SDK
MIT License
279 stars 49 forks source link

Replacing text in the whole document #120

Closed MarianSWA closed 1 year ago

MarianSWA commented 1 year ago

I'm trying to figure out if there is a way to replace some text in the entire document. So far I managed to successfully use this solution:

string xml = document._document.InnerXml;
document._document.InnerXml = xml.Replace("a", "b");
document.Save(false);

But this does it only in the document part (not in footers & headers). Is there a more "official" way to do this in the library? If not, what would be the day to do it? Thank you!

PrzemyslawKlys commented 1 year ago

There is:

Its not released but I have added find and replace functionality including CleanupDocument functionality. This may be required because word tends to split some texts even with same formatting into separate runs.

MarianSWA commented 1 year ago

Coool, exactly what I was looking for 😄 . Any idea when will it be released? I'm also curious, is there an internal InnerXml, that gives us access to the entire document XML including headers, footers and everything? If this is the case, maybe I'm missing something but in terms of implementation, wouldn't be easier to do CleanupDocument, then just a replace in the InnerXml?

PrzemyslawKlys commented 1 year ago

You could. The headers, and footers (all of them are stored in the document, just under their own place.

image

It probably would be even faster.

My goal, in the end, is that Find/FindAndReplace will be available on each level so you could target specific tables and just search there. Just needed a starting point and some motivation. Also I am not sure if InnerXML won't give you false positives finding stuff that's not really text.

MarianSWA commented 1 year ago

Nice! I also think replacing in any level would be useful. You're right about false positives, for my use-case, it's not a problem, because we replace only specific texts in a specific format (like {{parameter}} ), but the InnerXml approach wouldn't be useful for any scenario.

My final solution for my use-case (a fast replace of parameters in the entire document + headers/footers) is this:

       using (WordDocument document = WordDocument.Load(filePath))
        {
            document.CleanupDocument();
            MainDocumentPart mainDocument = document._wordprocessingDocument.MainDocumentPart;
            if (mainDocument is not null)
            {
                mainDocument.Document.InnerXml = mainDocument.Document.InnerXml.Replace("{{param}}", "value");
                mainDocument.HeaderParts.ToList().ForEach(h => h.Header.InnerXml = h.Header.InnerXml.Replace("{{param}}", "value"));
                mainDocument.FooterParts.ToList().ForEach(f => f.Footer.InnerXml = f.Footer.InnerXml.Replace("{{param}}", "value"));
            }

            document.Save(false);
        }

Maybe it helps someone :)

Thanks @PrzemyslawKlys for this great initiative with this library!

PrzemyslawKlys commented 1 year ago

I'll add example that's "native" to OfficeIMO just in case someone comes looking. Of course it's not written in stone it will stay like that but I like how it shows whether it found something and how many replacements it did.

using System;
using System.Linq;
using DocumentFormat.OpenXml.Wordprocessing;
using OfficeIMO.Word;

namespace OfficeIMO.Examples.Word {
    internal static partial class FindAndReplace {
        internal static void Example_FindAndReplace01(string folderPath, bool openWord) {
            Console.WriteLine("[*] Creating standard document - Find & Replace");
            string filePath = System.IO.Path.Combine(folderPath, "Basic Document to replace text.docx");
            using (WordDocument document = WordDocument.Create(filePath)) {
                document.AddParagraph("Test Section");

                document.Paragraphs[0].AddComment("Przemysław", "PK", "This is my comment");

                document.AddParagraph("Test Section - another line");

                document.Paragraphs[1].AddComment("Przemysław", "PK", "More comments");

                document.AddParagraph("This is a text ").AddText("more text").AddText(" even longer text").AddText(" and even longer right?");

                document.AddParagraph("This is a text ").AddText("more text 1").AddText(" even longer text 1").AddText(" and even longer right?");
                // we now ensure that we add bold to complicate the search
                document.Paragraphs[9].Bold = true;
                document.Paragraphs[10].Bold = true;

                document.Save(false);
            }

            using (WordDocument document = WordDocument.Load(filePath)) {
                var replacedCount = document.FindAndReplace("Test Section", "Production Section");
                Console.WriteLine("Replaced: " + replacedCount);

                // should be 0 because it stretches over 2 paragraphs
                var replacedCount1 = document.FindAndReplace("This is a text more text", "Shorter text");
                Console.WriteLine("Replaced (should be 0): " + replacedCount1);

                document.CleanupDocument();

                // cleanup should merge paragraphs making it easier to find and replace text
                // this only works for same formatting though
                // may require improvement in the future to ignore formatting completely, but then it's a bit tricky which formatting to apply
                var replacedCount2 = document.FindAndReplace("This is a text more text", "Shorter text");
                Console.WriteLine("Replaced (should be 1): " + replacedCount2);

                document.Save(false);
            }

            using (WordDocument document = WordDocument.Load(filePath)) {

                Console.WriteLine(document.Paragraphs[0].Text == "Production Section" ? "OK" : "FAIL");

                document.Save(openWord);
            }
        }
    }
}

I also wanted to show that if you do Find() it would show you in which areas it found something and how many times. So not only would it return Paragraphs, but also show some statistics or so.

MarianSWA commented 1 year ago

This is everything one could ask for! I think for the case where 2 different formattings is applied to the text that is being replaced, it's ok to not do the replace in that case. In 99% of cases, what most people replace should be a parameter or some text enclosed in another text, so it should have the same formatting. Maybe for the Find, it could be useful to search in texts with multiple formattings, but for replace, not so much.

PrzemyslawKlys commented 1 year ago

Ye, one thing I am having a hard time doing (and this is where Cleanup Document is very handy, but with different formatting it just not used) is if someone searches for a text that spreads across paragraphs. I am not quite sure my brain can handle "searching Text that would need to be found in on .Text property and then continued thru multiple .Text properties. I will most likely need some help with this :)

MarianSWA commented 1 year ago

I think it's not worth implementing right now, it's a very niche thing, and when someone actually needs it, it would be more easy when there are actually real world examples and expectations on how it should behave. For now, the FindAndReplace you implemented is more than enough, for the vast majority of cases, I think 😄

PrzemyslawKlys commented 1 year ago

Yes and no. So I would expect someone searching for something like Your company value is 100,000$ and then a user who is actually playing with text formatting formats 10000 with bold and rest of the text with no formatting. And then you have to find 100k and replace it, but only if the full text is Your company value is 100,000$. In this case it would be two WordParagraphs or even more if they play a lot how it's formatted. It's a niche thing, but I can see that happening. So just leaving it as something to think about.