dolanmiu / docx

Easily generate and modify .docx files with JS/TS with a nice declarative API. Works for Node and on the Browser.
https://docx.js.org/
MIT License
4.36k stars 484 forks source link

Add validator (Nonconformance to Office Open XML schema) #947

Closed devoidfury closed 3 years ago

devoidfury commented 3 years ago

EDIT: The actionable thing to do here is add a javascript validator against one of the wml.xsd schemas.

==========

Hey there! Great library, I've been using it a while and trying to help out a little where I can.

An issue I've come across is that it's really easy to generate a corrupted document, and tricky to pinpoint exactly where and why this happens. It's not the fault of this library, and to be honest there aren't any good options for validating these XML documents in javascript (I'm working on this -- soon I hope to have a validator with specific error messages in pure js!).

I've set up a hacky tool to validate these documents locally on linux with libxml/xmllint -- I'll share that setup once I write a little wrapper around it -- and I've noticed that it spits out a ton of errors. One of the most common errors is that, in several places in the spec, there's a specific sequence of nodes expected in order to conform. Nodes being out of order mostly works, but I suspect it's caused some bugs!

See for example, the schema for the base abstract type used under w:pPr - notice the <xsd:sequence> - this means they must be in this specific order to conform.

  <xsd:complexType name="CT_PPrBase">
    <xsd:sequence>
      <xsd:element name="pStyle" type="CT_String" minOccurs="0"/>
      <xsd:element name="keepNext" type="CT_OnOff" minOccurs="0"/>
      <xsd:element name="keepLines" type="CT_OnOff" minOccurs="0"/>
      <xsd:element name="pageBreakBefore" type="CT_OnOff" minOccurs="0"/>
      <xsd:element name="framePr" type="CT_FramePr" minOccurs="0"/>
      <xsd:element name="widowControl" type="CT_OnOff" minOccurs="0"/>
      <xsd:element name="numPr" type="CT_NumPr" minOccurs="0"/>
      <xsd:element name="suppressLineNumbers" type="CT_OnOff" minOccurs="0"/>
      <xsd:element name="pBdr" type="CT_PBdr" minOccurs="0"/>
      <xsd:element name="shd" type="CT_Shd" minOccurs="0"/>
      <xsd:element name="tabs" type="CT_Tabs" minOccurs="0"/>
      <xsd:element name="suppressAutoHyphens" type="CT_OnOff" minOccurs="0"/>
      <xsd:element name="kinsoku" type="CT_OnOff" minOccurs="0"/>
      <xsd:element name="wordWrap" type="CT_OnOff" minOccurs="0"/>
      <xsd:element name="overflowPunct" type="CT_OnOff" minOccurs="0"/>
      <xsd:element name="topLinePunct" type="CT_OnOff" minOccurs="0"/>
      <xsd:element name="autoSpaceDE" type="CT_OnOff" minOccurs="0"/>
      <xsd:element name="autoSpaceDN" type="CT_OnOff" minOccurs="0"/>
      <xsd:element name="bidi" type="CT_OnOff" minOccurs="0"/>
      <xsd:element name="adjustRightInd" type="CT_OnOff" minOccurs="0"/>
      <xsd:element name="snapToGrid" type="CT_OnOff" minOccurs="0"/>
      <xsd:element name="spacing" type="CT_Spacing" minOccurs="0"/>
      <xsd:element name="ind" type="CT_Ind" minOccurs="0"/>
      <xsd:element name="contextualSpacing" type="CT_OnOff" minOccurs="0"/>
      <xsd:element name="mirrorIndents" type="CT_OnOff" minOccurs="0"/>
      <xsd:element name="suppressOverlap" type="CT_OnOff" minOccurs="0"/>
      <xsd:element name="jc" type="CT_Jc" minOccurs="0"/>
      <xsd:element name="textDirection" type="CT_TextDirection" minOccurs="0"/>
      <xsd:element name="textAlignment" type="CT_TextAlignment" minOccurs="0"/>
      <xsd:element name="textboxTightWrap" type="CT_TextboxTightWrap" minOccurs="0"/>
      <xsd:element name="outlineLvl" type="CT_DecimalNumber" minOccurs="0"/>
      <xsd:element name="divId" type="CT_DecimalNumber" minOccurs="0"/>
      <xsd:element name="cnfStyle" type="CT_Cnf" minOccurs="0" maxOccurs="1"/>
    </xsd:sequence>
  </xsd:complexType>

Sadly, this is not documented anywhere in the officeopenxml.com site, and is only found in the ECMA-376 reference schemas (see for example, ECMA-376 fifth edition, part one, page 3839, containing a version of the above element type).

https://www.ecma-international.org/publications-and-standards/standards/ecma-376/

devoidfury commented 3 years ago

Related issue: #876

Sequence for the w:lvl element:

  <xsd:complexType name="CT_Lvl">
    <xsd:sequence>
      <xsd:element name="start" type="CT_DecimalNumber" minOccurs="0"/>
      <xsd:element name="numFmt" type="CT_NumFmt" minOccurs="0"/>
      <xsd:element name="lvlRestart" type="CT_DecimalNumber" minOccurs="0"/>
      <xsd:element name="pStyle" type="CT_String" minOccurs="0"/>
      <xsd:element name="isLgl" type="CT_OnOff" minOccurs="0"/>
      <xsd:element name="suff" type="CT_LevelSuffix" minOccurs="0"/>
      <xsd:element name="lvlText" type="CT_LevelText" minOccurs="0"/>
      <xsd:element name="lvlPicBulletId" type="CT_DecimalNumber" minOccurs="0"/>
      <xsd:element name="lvlJc" type="CT_Jc" minOccurs="0"/>
      <xsd:element name="pPr" type="CT_PPrGeneral" minOccurs="0"/>
      <xsd:element name="rPr" type="CT_RPr" minOccurs="0"/>
    </xsd:sequence>
    <xsd:attribute name="ilvl" type="ST_DecimalNumber" use="required"/>
    <xsd:attribute name="tplc" type="ST_LongHexNumber" use="optional"/>
    <xsd:attribute name="tentative" type="s:ST_OnOff" use="optional"/>
  </xsd:complexType>
devoidfury commented 3 years ago

I made quite a bit of progress on this, here: https://github.com/dolanmiu/docx/compare/master...devoidfury:bug/ooxml-conformance-fixes

The main errors I'm getting that I don't know how to handle:

Invalid mirrorMargins attribute on w:pgMar EDIT: this has been removed on my branch.

Invalid element w:shdCs (couldn't find a reference for these anywhere -- should it just be deleted? Looks like w:shd does everything here) EDIT: this has been removed in my branch.

w:document has an invalid attribute mc:Ignorable="w14 w15 wp14", couldn't find a reference or documentation for this property anywhere. This is written about here, and it's a commonly used attribute among various XML document types: http://www.wordarticles.com/Articles/Formats/OOXML/OOXML.php

dolanmiu commented 3 years ago

Amazing

Happy to make this part of the CI process to validate the schema once this is done

dolanmiu commented 3 years ago

Yes, if w:shdCs not in the spec, or not anywhere, it can be removed

devoidfury commented 3 years ago

Here's the validator setup I'm using in the meantime -- I want to make a JS version, but this is the quick and dirty solution that's enabled me to validate the documents against the schema.

https://github.com/devoidfury/docx-validator

devoidfury commented 3 years ago

Would it be helpful to inline these xsd types as comments? So that we have a reference to the valid attributes/children/sequences right in line?

dolanmiu commented 3 years ago

I think so yes, adding these xsd types is invaluable

dolanmiu commented 3 years ago

@devoidfury I am adding it into GitHub Actions

Thank you for your research into this area

The checks are based on the same OOXML schemas on your docx-validator project:

https://github.com/dolanmiu/docx/pull/1202