Closed gkholman closed 5 months ago
Some new information today: Wordinator creates a Word file that Word opens in Compatibility Mode. The problem is manifest.
Creating a new file creates a file not in Compatibility Mode, and the problem is not manifest. Saving that new file as a ".doc" instead of ".docx" changes the status to be in Compatibility Mode, and the problem is manifest without changing any data in the file.
Saving that new file with the problem manifest as yet another new ".docx" file, Word presents the dialogue about upgrading the file, the file is saved, and the problem is not manifest anymore.
Can Wordinator be configured to produce Word files that are opened as ".docx" not in Compatibility Mode, rather than ".docx" in Compatibility Mode?
I'll look into this tomorrow.
What Word actually outputs is the below. As you can see, the only difference between paragraphs that are formatted correctly and those that are not is that the ones that are not formatted correctly have <w:type w:val="nextPage"/>
.
It's not immediately obvious whether this is a Word bug or what it is.
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
<w:body>
<w:p>
<w:pPr>
<w:pStyle w:val="P"/>
<w:sectPr>
<w:type w:val="nextPage"/>
</w:sectPr>
</w:pPr>
<w:r>
<w:t>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Nullam justo. Mauris eleifend, pede at congue porttitor, magna tellus consectetuer odio, et dapibus quam velit quis lectus. Curabitur vestibulum mattis leo.</w:t>
</w:r>
</w:p>
<w:p>
<w:pPr><w:pStyle w:val="P"/></w:pPr>
<w:r>
<w:t>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Nullam justo. Mauris eleifend, pede at congue porttitor, magna tellus consectetuer odio, et dapibus quam velit quis lectus. Curabitur vestibulum mattis leo.</w:t>
</w:r>
</w:p>
<w:p>
<w:pPr><w:pStyle w:val="P"/><w:sectPr><w:type w:val="nextPage"/></w:sectPr></w:pPr>
<w:r>
<w:t>Vestibulum eget massa. Etiam a ligula sed massa placerat interdum. In hac habitasse platea dictumst. Curabitur augue sapien, tincidunt non, varius et, tincidunt non, augue. Proin urna.</w:t>
</w:r>
</w:p>
This <w:type w:val="nextPage"/>
simply says the section begins on the next page. That's all OOXML has to say about this.
However, it seems this is something to do with the predefined style P
. If I remove that style from the paragraphs in the input document then the resulting formatting is correct.
I don't think the paragraph marks and pressing enter have anything to do with this. Pressing enter just creates a new paragraph and moves the problem to the next paragraph (which doesn't show, as it's empty).
Unfortunately, I don't have a working version of Word here (would need to pay for license), so I can't investigate further in Word. But it seems that the problem is neither Word nor Wordinator, but the P
style.
P
is defined as:
<w:style w:type="paragraph" w:customStyle="1" w:styleId="P">
<w:name w:val="P"/>
<w:basedOn w:val="Normal"/>
<w:qFormat/>
<w:rsid w:val="00943E5D"/>
<w:pPr>
<w:jc w:val="both"/>
</w:pPr>
<w:rPr>
<w:sz w:val="28"/>
</w:rPr>
</w:style>
The w:sz
defines the font size in half points, so it says 14pt, basically.
w:jc
specifies paragraph alignment and both
means "justify text between both margins equally, but inter-character spacing is not affected." That's not really what Word is doing here, but I think it's noteworthy that the formatting we expect is the one specified by start
.
If I edit the docx file so that w:jc
is set to start
we get the formatting we expect.
So the problem is the P
style, and not Word or Wordinator. This apparently is not a bug.
And, yet, the data I've provided uses the same P style on paragraphs that are not the last in a section, as well as last in the section.
And when the generated "Compatibility Mode" output is upgraded to DOCX without changing the P style, the result works without changing.
I'm unclear why the unchanged setting for "jc" wouldn't work everywhere in "Compatibility Mode", nor why revealing the paragraph marks reveals there are paragraph marks at the end of every paragraph EXCEPT the paragraph at the end of the section.
Surely if Wordinator consistently added the paragraph marker consistently at the end of every paragraph we create, it would work in both Compatibility Mode and not. That was what I identified regarding the paragraph marks: the lack of consistency. The SWPX markup, as I illustrated, is identical for every paragraph I marked up.
If you look at the OOXML that Wordinator produces there is nothing like a paragraph mark in there. So that mark is some sort of visualization that Word produces, and I guess the absence of it is another symptom of the problem, rather than a cause.
I agree it looks like a bug, but it looks like a bug in Word. As far as I can tell, Wordinator is doing exactly what it's supposed to do.
Do we need jc=both
or can we use jc=start
? If we can use the latter there is no problem. If we must use the former then that raises all sorts of thorny questions, like what does both
actually mean, and why is it that we must use it.
I can find no user interface manipulation of jc=
in Word ... but, then, I'm no expert in Word.
I think jc=
corresponds to "paragraph alignment" in the UI. More explanation.
Of course, yes, it corresponds to paragraph alignment, but the controls in Word regarding paragraph alignment are not so finely tunable to modify any concept of "both"/"start" for an attribute of the name "jc". The coarse adjustment only selects between "Left", "Centered", "Right", and "Justified".
XSL-FO specifically addresses what jc= addresses, and that is the alignment specifically of the last line of a paragraph of text.
So I'm not surprised there exists a property in OOXML supporting a traditional layout concept ... I stated that I could find no user interface manipulation of that concept in the Word tool.
My Mac version of Word has these buttons:
Left to right they seem to correspond to jc=
being set to start
, center
, end
, and finally either both
or distribute
. I don't have a license, so I can't actually edit documents and check. It's possible that distribute
would work, too.
Ultimately, whatever the solution to this is, it has to be something that we can express in OOXML.
Can you confirm when you set jc="start" that the OTHER lines of the paragraph remain justified? The user requirement is that all lines except the last line of each paragraph be justified. If not, then this has been a red herring and the drop-down list I found, and the buttons you found, all correspond only to jc= and are irrelevant to the problem because there are no other fine tunings available, as there is in XSL-FO, for the alignment of the last line of a paragraph.
When Word in "Compatibility Mode" presents the last justified paragraph of a section, Word justifies the last line of a justified paragraph. Word does not justify the last line of other justified paragraphs in the section.
When upgrading the file out of "Compatibility Mode" by saving the document appropriately, the appearance of the unmodified last justified paragraph of a section is correct. Word modifies the paragraph and the last line is presented properly.
I suppose I could hack every file we create by adding an empty paragraph of zero or one pt font at the end of every section ... but that doesn't reflect the user data and, likely, the conversion of the Word file to STS XML downstream will then add that paragraph element that never was created by the user.
So far, this is what "fixes" the presentation of the last line of the last paragraph in a section when the paragraph is justified:
Is there a serialization option in the use of POI that uses the "upgraded" DOCX instead of DOCX in "Compatibility Mode"?
Can you confirm when you set jc="start" that the OTHER lines of the paragraph remain justified?
They don't. If I use distribute
then all lines are justified, including the last line of every paragraph. This suggests to me that there is something strange either about justification in Word in general, or about justification in Word together with sections, or that jc
isn't enough alone.
The user requirement is that all lines except the last line of each paragraph be justified.
That's definitely a reasonable requirement, and there doesn't seem to be a jc
value that gives them that. But it's got to be possible somehow.
When upgrading the file out of "Compatibility Mode" by saving the document appropriately, the appearance of the unmodified last justified paragraph of a section is correct.
Hmmm. Word for Mac doesn't create any warnings about compatibility mode. I'm not sure how important this is. It sounds like it might be just a question of Word versions.
I suppose I could hack every file we create by adding an empty paragraph of zero or one pt font at the end of every section ... but that doesn't reflect the user data and, likely, the conversion of the Word file to STS XML downstream will then add that paragraph element that never was created by the user.
Yes. We also don't know what other problems this might create in other use cases.
Part of the difficulty here is that the settings which appear to cause the problem do not actually appear in the file that the Wordinator code is creating (word/document.xml
inside the docx file), but in word/styles.xml
, which is partly created from the template file. This means there are limits to what options Wordinator has, because literally everything Wordinator says about the last paragraph is:
<w:p>
<w:pPr>
<w:pStyle w:val="P"/>
<w:sectPr>
<w:type w:val="nextPage"/>
</w:sectPr>
</w:pPr>
<w:r>
<w:t>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Nullam justo. Mauris eleifend, pede at congue porttitor, magna tellus consectetuer odio, et dapibus quam velit quis lectus. Curabitur vestibulum mattis leo.</w:t>
</w:r>
</w:p>
As you can see we don't even have the jc
setting, so Wordinator has little scope for playing around with workarounds, and even detecting the problem is very difficult. We'd need to look up the style, do tests on that, and also check the section context etc.
There is one thing we can perhaps try to explore a little more, and that's this hint:
For all sections except the last section, the sectPr element is stored as a child element of the last paragraph in the section. For the last section, the sectPr is stored as a child element of the body element.
Is there something about having the section properties stored on it that makes Word treat the last paragraph differently? I could try removing the sections just to see, but of course that's not really a solution, either. But at least we might learn more about what's going on.
Overall, this looks like a very tricky problem, because it's not at all clear what's really causing the problem. I think there are only two viable routes toward a solution here:
Unfortunately, I suspect we could end up spending a lot of time before we get to the bottom of this. So we need to consider how much time this is worth.
Where you say...
That's definitely a reasonable requirement, and there doesn't seem to be a jc value that gives them that. But it's got to be possible somehow.
... in fact we are getting exactly what we need for all paragraphs except the last paragraph of a section, even when all paragraphs are marked up the same. My very first post has the DOCX that shows this, created using Wordinator, from an SWPX file that also is in the ZIP.
This is the PDF that is created from Word: paragraph-marker-20230213.pdf
... and one sees that the second and third pages show the first paragraph of the page correctly formatted as we need: justification of all lines except the last line. And the third page shows the second paragraph correctly formatted as we need, even though it presented incorrectly on the second page, using the identical markup.
Could you try a couple of experiments?
First, solve the problem the way you described in your initial comment, then save the Word file and send it to me (or attach here).
Second, start with a newly converted document, then try to apply the ctrl+shift+j trick to see if that lets you fix the last para. Also try it on one of the paras that is OK, to see what happens. Then save and share the document.
I don't have a Word where I can edit files, but this way we might be able to learn more about what OOXML achieves the desired result.
The original generated Word file and PDF: paragraph-marker-20230213.docx - paragraph-marker-20230213.pdf
Pressing \
Saving the original file new as an upgraded file: paragraph-marker-20230213-upgraded.docx - paragraph-marker-20230213-upgraded.pdf
Reading the trick page you cited, I see that the c+s+j expands the given line of a paragraph, it doesn't do the contraction that we need. Sure enough, that is the result that I see: paragraph-marker-20230213-tricked.docx - paragraph-marker-20230213-tricked.pdf
OOXML is achieving the desired result in the paragraph when the paragraph isn't the last paragraph of the section. There are no markup changes to compare between any non-last paragraph and the last paragraph.
And when I add a new paragraph at the end of an existing section, the paragraph that was incorrectly formatted magically becomes correctly formatted. But that is simply an extension of my "add \
In the docx where you pressed enter the first paragraph turns into this monstrosity:
<?xml version="1.0"?>
<w:p w14:paraId="775AB87C" w14:textId="77777777" w:rsidR="00605A42" w:rsidRDefault="008B3D8F">
<w:pPr>
<w:pStyle w:val="P"/>
</w:pPr>
<w:r>
<w:t xml:space="preserve">Lorem ipsum dolor sit </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>amet</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r>
<w:t xml:space="preserve">, </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>consectetuer</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r>
<w:t xml:space="preserve"> </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>adipiscing</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r>
<w:t xml:space="preserve"> </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>elit</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r>
<w:t xml:space="preserve">. </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>Nullam</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r>
<w:t xml:space="preserve"> </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>justo</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r>
<w:t xml:space="preserve">. </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>Mauris</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r>
<w:t xml:space="preserve"> </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>eleifend</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r>
<w:t xml:space="preserve">, </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>pede</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r>
<w:t xml:space="preserve"> at </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>congue</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r>
<w:t xml:space="preserve"> </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>porttitor</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r>
<w:t xml:space="preserve">, magna </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>tellus</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r>
<w:t xml:space="preserve"> </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>consectetuer</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r>
<w:t xml:space="preserve"> </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>odio</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r>
<w:t xml:space="preserve">, et </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>dapibus</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r>
<w:t xml:space="preserve"> </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>quam</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r>
<w:t xml:space="preserve"> </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>velit</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r>
<w:t xml:space="preserve"> </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>quis</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r>
<w:t xml:space="preserve"> </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>lectus</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r>
<w:t xml:space="preserve">. </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>Curabitur</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r>
<w:t xml:space="preserve"> vestibulum </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>mattis</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r>
<w:t xml:space="preserve"> </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>leo</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r>
<w:t>.</w:t>
</w:r>
</w:p>
The reason it works, however, is that there is an empty p
trailing it. What's interesting, however, is that that empty p
does not indicate the section the same way. Instead, it looks like this:
<w:p w14:paraId="45551865" w14:textId="74298185" w:rsidR="008B3D8F" w:rsidRDefault="008B3D8F">
<w:pPr>
<w:pStyle w:val="P"/>
<w:sectPr w:rsidR="008B3D8F">
<w:pgSz w:w="12240" w:h="15840"/>
<w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0"/>
<w:cols w:space="720"/>
</w:sectPr>
</w:pPr>
</w:p>
I'm not sure that can be used for anything, but the next p
begins like so:
<w:p w14:paraId="6C95BF65" w14:textId="77777777" w:rsidR="00605A42" w:rsidRDefault="008B3D8F">
<w:pPr>
<w:pStyle w:val="P"/>
</w:pPr>
<w:r>
<w:lastRenderedPageBreak/>
<w:t xml:space="preserve">Lorem ipsum dolor sit </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
So perhaps this <w:lastRenderedPageBreak/>
is the page-breaking trick that makes this work.
The upgraded document looks the same.
The document with the trick has <w:jc w:val="distribute"/>
, so that's clearly what that trick does.
Okay, but then it looks like the way to solve this is to not convert sections the way we have done so far, but to use this other markup instead. I'll see if I can do that, but I'm not sure when I'll be able to work on it.
The problem goes away when one opens the file, selects File/Info, and turns off "Compatibility Mode" ... no changes to the file and all paragraphs look correct.
If there is no way to configure the environment to produce a file that is not in "Compatibility Mode", I suggest shelving this ticket until a client complains about having to do the conversion manually after the fact.
If you review older issues, you'll see that managing the details of final paragraphs in sections is a significant challenge because of the way Word uses the last paragraph of a section to define the rules for the entire section.
If memory serves, this also affects how things like justification of the last line of the last paragraph of a section get handled.
Okay. The <w:lastRenderedPageBreak/>
method might solve this, but it requires substantial changes to how wordinator represents sections. Given what Eliot writes I think there's a good chance that making this change will introduce new errors. It doesn't really seem worth exploring this route at the moment.
I tried the method ChatGPT suggested. I can't see anything in the output files that say what .docx version is used, so I'm not convinced this can be done at all. The code from ChatGPT is basically nonsensical when you compare it to the actual Apache POI API. The classes and methods quite simply don't fit together this way, and some of the methods appear not to exist at all. So this seems to be a dead end.
Trying to google for a way to set the version of the .docx format yields nothing. (As expected, since that doesn't seem to be a meaningful thing at all.) In fact, there only appear to be two versions: the ECMA and the ISO version, and there doesn't seem to be any way to declare which one you're producing. So this doesn't seem to be a meaningful thing to try to do.
Yes, my initial attempts to "correct" or at least control the behavior of sections under specific conditions ran into some significant rework issues.
Part of the issue, if I'm remembering correctly, was that if there is exactly one section then one set of markup in the DOCX is required, but if there are two or more, then different markup is required, and the XML cursor technique used with POI doesn't make lookahead easy, so the problem is best solved in the SWPX generation, where you have all knowledge.
Closing as fixed: Turning off compatibility mode corrects the issue.
Note that this became possible with POI 5.2.5 which enabled modifying the document-level settings using the XWPX API.
... resulting in the last line of a justified paragraph being justified instead of aligned left.
I diagnosed this by turning on paragraph marks and noting the absence of the paragraph mark at the end of the paragraph. Adding it my hand made the justified paragraph format correctly with the last line not being justified.
To illustrate this, the attached is a three-section SWPX file, each with one, two, and three paragraphs. The order and content of the paragraphs is the same in the three sections. The last paragraph of each section is not formatted correctly when brought into word as is. Pressing \ at the end of the last paragraph of the section formats that paragraph correctly.
paragraph-marker-20230213.zip