dotnet / Open-XML-SDK

Open XML SDK by Microsoft
https://www.nuget.org/packages/DocumentFormat.OpenXml/
MIT License
3.99k stars 544 forks source link

Excel workbook generated by Open-XML-SDK are converted to XLSM by Google spread sheet. #1093

Open Yanal-Yves opened 2 years ago

Yanal-Yves commented 2 years ago

Description First I would like to thank you for this library.

I am posting here a wired behavior when generating an Excel file with Open XML SDK and uploading the file to Google spreadsheet. I believe this is an oddity caused by a bug in Open XML SDK. This is why I am posting here. I don't think this is a question about using Open XML SDK. But in case I'am wrong, I have also posted to Stack overflow :

So I am using the Open XML SDK 2.14.0 to create Excel files. If I upload the XLSX file produced by Open XML SDK to Google Spreadsheet, it opens the file as an XLSX file but if I make any change in the spreadsheet then Google Spreadsheet changes the file type to XLSM. If I generate an empty workbook from MS Excel (Office 360) or Libre Office Calc I don't have this issue : Google Spreadsheet will not change the file type if I change the data.

Information

Repro

The minimalistic code bellow (found in the official documentation) reproduces the issue. It generate a foo.xlsx file in C:\Temp.

var filepath = @"C:\temp\foo.xlsx";

// Create a spreadsheet document by supplying the filepath.
// By default, AutoSave = true, Editable = true, and Type = xlsx.
SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Create(filepath, SpreadsheetDocumentType.Workbook);

// Add a WorkbookPart to the document.
WorkbookPart workbookpart = spreadsheetDocument.AddWorkbookPart();
workbookpart.Workbook = new Workbook();

// Add a WorksheetPart to the WorkbookPart.
WorksheetPart worksheetPart = workbookpart.AddNewPart<WorksheetPart>();
worksheetPart.Worksheet = new Worksheet(new SheetData());

// Add Sheets to the Workbook.
Sheets sheets = spreadsheetDocument.WorkbookPart.Workbook.AppendChild<Sheets>(new Sheets());

// Append a new worksheet and associate it with the workbook.
Sheet sheet = new Sheet() { Id = spreadsheetDocument.WorkbookPart.GetIdOfPart(worksheetPart), SheetId = 1, Name = "mySheet" };
sheets.Append(sheet);

workbookpart.Workbook.Save();

// Close the document.
spreadsheetDocument.Close();

Observed

If I upload the generated file to Google Spreadsheet, it shows this is an XLSX file.

enter image description here

But if I change anything in the workbook then Google Spreadsheet changes the file type to XLSM.

enter image description here

Expected

Open XML SDK generates the correct XLSX file so that Google spreadsheet does not change to XLSM when saving the first time.

twsouthwick commented 2 years ago

@tomjebo Any thoughts?

tomjebo commented 2 years ago

@Yanal-Yves did this happen with previous versions of the SDK? I'll test it out and see if I can tell what's causing the change to a macro type workbook.

tomjebo commented 2 years ago

I just created workbook (.xlsx) using your code. OneDrive doesn't do this conversion to .xlsm. Google Drive does. They must have some security constraint that kicks in. I'll see if I can tell what that is but it may be something as simple as inconsistent attributes or some other factor.

Yanal-Yves commented 2 years ago

Hi, thank you for your answer. I have juste tested with version 2.13.1 of the SDK and I can observer the same behavior.

Yanal-Yves commented 2 years ago

I have also tested with version 2.5.0 and I can observer the same behavior too.

Yanal-Yves commented 2 years ago

I have also tested with version 1.0.0 and I can observer the same behavior too.

tomjebo commented 2 years ago

Yeah, looks like the workbook generated is fine but definitely that code would not include several parts that Excel does when creating a new, fresh workbook. The parts not included by the SDK genned file are: docprop\app.xml, docprop\core.xml, xl\sharedstrings.xml, xl\styles.xml, and xl\theme\theme1.xml. There are some differences in the workbook.xml, workbook.xml.rels, [content_types].xml and sheet1.xml but they all look reasonable and should not cause anything like a filetype change AFAIK. We might have to ask Google about this. Otherwise, we can just start adding in the other parts and see when Sheets likes the file as .xlsx.

tomjebo commented 2 years ago

@Yanal-Yves I just added this line:

CoreFilePropertiesPart coreDocProps = spreadsheetDocument.AddCoreFilePropertiesPart();

And it seemed to fix the behavior. Can you try it?

Yanal-Yves commented 2 years ago

I've just tested with

var filepath = @"C:\temp\foo.xlsx";

// Create a spreadsheet document by supplying the filepath.
// By default, AutoSave = true, Editable = true, and Type = xlsx.
SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Create(filepath, SpreadsheetDocumentType.Workbook);
spreadsheetDocument.AddCoreFilePropertiesPart();

// Add a WorkbookPart to the document.
WorkbookPart workbookpart = spreadsheetDocument.AddWorkbookPart();
workbookpart.Workbook = new Workbook();

// Add a WorksheetPart to the WorkbookPart.
WorksheetPart worksheetPart = workbookpart.AddNewPart<WorksheetPart>();
worksheetPart.Worksheet = new Worksheet(new SheetData());

// Add Sheets to the Workbook.
Sheets sheets = spreadsheetDocument.WorkbookPart.Workbook.AppendChild<Sheets>(new Sheets());

// Append a new worksheet and associate it with the workbook.
Sheet sheet = new Sheet() { Id = spreadsheetDocument.WorkbookPart.GetIdOfPart(worksheetPart), SheetId = 1, Name = "mySheet" };
sheets.Append(sheet);

workbookpart.Workbook.Save();

// Close the document.
spreadsheetDocument.Close();

And I can confirm I don't have the issue anymore.

Please note that the call to spreadsheetDocument.AddCoreFilePropertiesPart(); should be done early. If done just before workbookpart.Workbook.Save();, the issue is still there.

@tomjebo, I'd like to thank you for your help.

For information I have submitted an issue to Google : https://issuetracker.google.com/issues/210875597

Yanal-Yves commented 2 years ago

BTW I have also answered the Stack Overflow question : https://stackoverflow.com/questions/70319867/avoid-google-spreadsheet-to-convert-an-xlsx-file-created-by-open-xml-sdk-to-xlsm/70371638#70371638

tomjebo commented 2 years ago

@Yanal-Yves thanks for confirming and yes, I added the core part right after creating the document. I think we will consider adding this (and possibly the other standard parts) to the creation process. I'll create a PR for this and we can review and discuss but it looks like it should be straightforward.

tomjebo commented 2 years ago

Although it makes some sense for us to create workbooks with these common parts, I think this Google Sheets behavior is a little strange. I don't know what the rationale for it is and it should probably be reported to Google.

Yanal-Yves commented 2 years ago

Hi @tomjebo,

Calling AddCoreFilePropertiesPart() results in an empty core.xml file. And Excel throw an error.

image

The open XML SDK Productivity Tool for Microsoft Office also cannot not open the file generated:

image

Thanks to this doc , I was able to work around the issue.

spreadsheetDocument.AddCoreFilePropertiesPart();
using (XmlTextWriter writer = new XmlTextWriter(coreFilePropPart.GetStream(FileMode.Create), System.Text.Encoding.UTF8))
{
  writer.WriteRaw("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n<cp:coreProperties xmlns:cp=\"https://schemas.openxmlformats.org/package/2006/metadata/core-properties\"></cp:coreProperties>");
  writer.Flush();
}

IMHO:

  1. This is a workaround, is there a proper way to avoid having an empty core.xml ?
  2. I believe that AddCoreFilePropertiesPart() should not produce an inconsistent XLSX state and should create the root element.
tomjebo commented 2 years ago

Interesting, I'll have to test this again. It didn't throw an error for me. Are you using an older version of Excel?

tomjebo commented 2 years ago

Ok, tested and yes, you're right, it's empty. It satisfied the Google Sheets test but didn't satisfy Excel webapp or desktop. What you're doing might be needed for now because I don't see that we have the root element defined in the SDK. The part is defined in a bare bones manner. I'll add this to our "enhancement" project list which will be to build out "empty" documents for Word, PowerPoint and Excel to mimic what the Office apps do to some better degree.

tomjebo commented 2 years ago

@Yanal-Yves I created PR #1107 to test this out in the Create function. If it's accepted we can use this for now and add other parts if we think they're needed later.

ThomasBarnekow commented 2 years ago

@Yanal-Yves, there is no bug in the Open XML SDK. What everybody needs to understand is that the Open XML SDK provides a super-low level API for creating Open XML packages and markup. Therefore, by design, developers will have to:

If you omit any of the above steps, it will not work because the SDK has not been used as designed. Thus, looking at your code, the error is not to add the Open XML markup to the CoreFilePropertiesPart.

I would also recommend leveraging using statements when creating or opening documents, in which case you don't have to save and close those documents explicitly. You might see that in examples, but that's another issue ;-).

Unfortunately, the Open XML SDK does not include strongly-typed classes for the CoreFilePropertiesPart. For that reason, there is also no generated code in the DocumentFormat.OpenXml.Linq project and namespace. We should either add the required schema to generate the code or add the following to the project (with some documentation, of course):

public static class CP
{
    public static readonly XNamespace cp =
        "http://schemas.openxmlformats.org/package/2006/metadata/core-properties";

    public static readonly XName category = cp + "category";
    public static readonly XName contentStatus = cp + "contentStatus";
    public static readonly XName contentType = cp + "contentType";
    public static readonly XName coreProperties = cp + "coreProperties";
    public static readonly XName keywords = cp + "keywords";
    public static readonly XName lastModifiedBy = cp + "lastModifiedBy";
    public static readonly XName lastPrinted = cp + "lastPrinted";
    public static readonly XName revision = cp + "revision";
}

Using the Linq-to-XML way, you could then do the following:

var filepath = @"C:\temp\foo.xlsx";

// Create an empty SpreadsheetDocument with no OpenXmlParts.
// With the using statement, the document will be saved automatically once you leave the scope.
using var spreadsheetDocument = SpreadsheetDocument.Create(filepath, SpreadsheetDocumentType.Workbook);

// Add a CoreFilePropertiesPart without any content.
CoreFilePropertiesPart part = spreadsheetDocument.AddCoreFilePropertiesPart();

// Set the part contents, i.e., an empty cp:coreProperties element with no actual properties.
// This specific call will also save the contents to the part.
part.SetXElement(new XElement(CP.coreProperties, new XAttribute(XNamespace.Xmlns + "cp", CP.cp)));

Whether or not the above should be simpler is a different question. For example, you might want to have something like the following:

using var spreadsheetDocument = SpreadsheetDocument.CreateMinimalPackage(filePath);

This would "magically" create a package that fulfills the minimum requirements defined by the standard or is close to what Excel might produce. The same method would have to exist for other subtypes of OpenXmlPackage.

We could also create a feature that might offer the following:

using var spreadsheetDocument = SpreadsheetDocument.Create(filepath, SpreadsheetDocumentType.Workbook);

// Call an extension method provided by a feature.
spreadsheetDocument.AddMinimumContent();
Yanal-Yves commented 2 years ago

@ThomasBarnekow,

Thank you for taking time to answer me, for your explanation about the philosophy of this library and for pointing out how to use Linq to XML to add the Open XML markup to the OpenXmlPart.

I am an Open XML SDK newbie and you probably know much more than I do about this SDK but I still feel wired to have methods (that seems high level) that produces inconsistent documents. I feel it's so error prone that even the official documentation does not provide the right code snippet to create a minimal empty document correctly. The code I used in this issue is from that documentation. What's the purpose of having a Create method that's not self sufficient ? My guess is that every developer that creates a document wants to start with a consistent minimal document. Again I don't know all your use case and I may be wrong.

Thank you for mentioning the missing using. The code I provided is a copy / past from the official documentation ;-) Thankfully in my production code I double checked and I've already noticed that SpreadsheetDocument is implementing IDisposable and didn't forget the using keyword.

Sorry to ask but I couldn't find the SetXElement method in the DocumentFormat.OpenXml package. I can see the OpenXmlPartRootXElementExtensions class in the Github but couldn't find that class when I decompile the DocumentFormat.OpenXml package with JetBrains dotPeek. Did I miss something. What NuGet Package / namespace import should I use to get the SetXElement extension method ?

tomjebo commented 2 years ago

@Yanal-Yves

The code I used in this issue is from that documentation. What's the purpose of having a Create method that's not self sufficient ?

Just a reminder that the behavior Google Sheets is showing is not completely understood. Otherwise, this has not been a problem with Microsoft Office (web or desktop) or any other apps AFAIK. I don't see that those parts are required by the standard, looking at OPC (29500-2) or markup standard (29500-1) or our extensions and implementer notes. We'll investigate this but I think this is ultimately an enhancement for now and also we should follow up with Google to find out what the real story is with their Sheets app behavior.

Yanal-Yves commented 2 years ago

@tomjebo,

OK understood, thank you for pointing that out.

As I mentioned earlier a call to AddCoreFilePropertiesPart is creating an inconsistent XLSX that causes Excel to ask to repair the file because it creates an emtpy core.xml. Isn't this a bug ?

I read the code of #1107 and maybe instead of calling AddCoreFilePropertiesPart in the Create method, you can add the root element to core.xml when AddCoreFilePropertiesPart is called ? So Create behavior stays unchanged and AddCoreFilePropertiesPart stops producing inconsistent documents.

ThomasBarnekow commented 2 years ago

@Yanal-Yves, the SetXElement method is one of the extension methods provided by the DocumentFormat.OpenXml.Linq project in this repository. It should have been published as a NuGet package together with DocumentFormat.OpenXml v2.15.0. Publishing somehow failed due to a permission issue, but it will be published shortly. In the meantime, you can get it through the repo's custom feed as described under Daily Builds on the main page.

To use the SetXElement and corresponding extension methods, you'd have to add a package reference to the DocumentFormat.OpenXml.Linq package (version 2.15.0). The methods are provided by a public static class in the DocumentFormat.OpenXml.Packaging namespace, which also contains the SpreadsheetDocument class, for example. Thus, if you have the using DocumentFormat.OpenXml.Packaging; required to use the SpreadsheetDocument, you are good to go. However, you do need to reference the DocumentFormat.OpenXml.Linq NuGet package.

As I mentioned earlier a call to AddCoreFilePropertiesPart is creating an inconsistent XLSX that causes Excel to ask to repair the file because it creates an emtpy core.xml. Isn't this a bug ?

No, it is not a bug. It's a feature. Calling methods like AddWorkbookPart, AddWorksheetPart, or AddCoreFilePropertiesPart always just adds an empty part (think XML file for now) to a package, i.e., a ZIP file having a well-defined structure (think folders, files, and metadata). Adding or changing the file contents is a separate step. Just creating a part is similar to creating a FileStream and then not writing any data to it.

So Create behavior stays unchanged and AddCoreFilePropertiesPart stops producing inconsistent documents.

See above. Changing the AddCoreFilePropertiesPart method would have far-reaching consequences. Firstly, we would have to make sure it does not constitute a breaking behavioral change. Secondly, we would have to change that behavior for all other methods like it (and there are a lot of those across all package types).

Using the Open XML PowerTools with a coming update to the Linq-to-XML feature as an example, you might see the following approach:

using DocumentFormat.OpenXml.Linq;
using DocumentFormat.OpenXml.Packaging;

using var wordDocument = WordprocessingDocument.Create(...);

// Add an empty part with no content.
MainDocumentPart part = wordDocument.AddMainDocumentPart();

// Get the reference to the currently empty XDocument representation of the part.
XDocument partXDocument = part.GetXDocument();

// Add (!) the root XElement, i.e., a w:document element, to the XDocument.
partXDocument.Add(
    new XElement(W.document, 
        new XAttribute(XNamespace.Xmlns + "w", W.w),
        new XElement(W.body,
            new XElement(W.p))));

The above is one approach that is used in the Open XML PowerTools and likely mimicked by many users. This would break if the AddMainDopcumentPart suddenly created a part with content rather than an empty part.

There are certainly other approaches to initializing the contents of a part that would not break but simply overwrite the existing content.

However, we are coming from the incorrect use of the admittedly low-level Open XML SDK. That incorrect use might have been informed by an incorrect example. So let's use the Open XML SDK as designed and consider correcting the incorrect examples.

Let's also understand that Open XML is extremely large and complex. The foundation document of the standard alone has north of 5,000 pages. To properly use the low-level Open XML SDK, you need to understand Open XML as well. Otherwise, you will not understand which parts and elements to use. Once you understand Open XML, the SDK will make sense to you. So welcome to a very interesting journey.

tomjebo commented 2 years ago

There were some additional comments on PR #1107 about how to resolve this. We will close PR #1107 so I'm copying the substantive comments here. Thanks @ThomasBarnekow for these comments:


ThomasBarnekow:

@tomjebo, having looked at your project description, I wanted to comment on your points here (as I did not see a way to do that in the project).

Won't add some of the basic parts that some apps like Google Sheets expect. core.xml, app.xml, etc...

This is covered by my comment above regarding the "minimal" documents. It would be a very good idea to add this to the SDK. While other libraries have filled at least some of the gaps, I believe this is so basic that it should be done by the SDK.

We don't define the root element for CoreFilePropertiesPart and that means the programmer has to add the contents of these parts (at least core.xml) with raw XML like this: [example not shown]

This is true for all parts. In all cases, you have to set that root element explicitly after having added a new part. Using a WordprocessingDocument and a MainDocumentPart as an example, you'll have to do the following to create a minimal document when using the strongly typed classes:

using var wordDocument = WordprocessingDocument.Create(stream, type); MainDocumentPart part = wordDocument.AddMainDocumentPart(); part.Document = new Document( new Body( new Paragraph())); I am not 100% sure whether the w:p element is required by the standard, but Word would add it. I've laid out the code to mimic the layout of the corresponding XML document.

Using the Linq-to-XML way, you'd do it like this.

using var wordDocument = WordprocessingDocument.Create(stream, type); MainDocumentPart part = wordDocument.AddMainDocumentPart(); part.SetXElement( new XElement(W.document, new XAttribute(XNamespace.Xmlns + "w", W.w), new XElement(W.body, new XElement(W.p))) It would be really cool if the SDK could create standards-compliant, minimal parts (e.g., MainDocumentPart, CoreFilePropertiesPart) when calling the AddMainDocumentPart, for example. That way, the Document and RootElement properties would not have to be marked as nullable.

We should investigate building these documents more completely like Office, even though Office doesn't throw an error on opening them.

Yes, absolutely!

Reference: PR 1107 issuecomment-996561981


ThomasBarnekow: It would certainly be nice to be able to tell the SDK to create "more complete" or "initially compliant" documents such as minimal Word, Excel, and PowerPoint documents that fulfill certain criteria. But those should have to be created both explicitly (e.g., through settings or otherwise) and consistently. "Consistently" should also mean "for all Office document types", i.e., Excel, Word, and PowerPoint. The latter would particularly benefit from a feature that creates a minimal PresentationDocument as the minimal markup required is much more involved than that of WordprocessingDocument and SpreadsheetDocument, for example.

Whatever magic we add should not create breaking changes. When creating documents with the SDK, no parts are added at the moment. Thus, callers would happily (try to) create another part of the same type, which will likely fail (I did not test that).

This method (or its siblings) would not be the place to inject such functionality. You'd have to duplicate this in all other Create methods. We'd need some other place, such as the constructor(s).

Reference: PR 1107 discussion_r771194825