CenterForOpenScience / pydocx

An extendable docx file format parser and converter
Other
183 stars 55 forks source link

Odd margin being added to bullets #224

Open jhubert opened 7 years ago

jhubert commented 7 years ago

When certain docx files that have adjusted margins get imported, the resulting HTML places the margin in the wrong place. This results in oddly formatted HTML.

For example, here are two lists in word:

image

The first list has been indented, the second one has the standard doc indentation.

Here is the result in HTML:

image

The resulting HTML has a span inside the li with a margin-left set on it:

<li><span style="margin-left:3.00em">This is a list item</span></li>

It seems that the whole ul should have the margin, if anything at all.

Here is the sample file: list-item-margin.docx

And here is the cleaned up docx source from the document.xml file:

  <w:p w14:paraId="25B98899" w14:textId="77777777" w:rsidR="00442583" w:rsidRDefault="00442583" w:rsidP="00442583">
    <w:r>
      <w:t>Headline:</w:t>
    </w:r>
  </w:p>
  <w:p w14:paraId="59BA247B" w14:textId="77777777" w:rsidR="00442583" w:rsidRDefault="00442583" w:rsidP="00442583">
    <w:pPr>
      <w:pStyle w:val="ListParagraph"/>
      <w:numPr>
        <w:ilvl w:val="0"/>
        <w:numId w:val="2"/>
      </w:numPr>
      <w:ind w:left="720"/>
    </w:pPr>
    <w:r>
      <w:t>This is a list item</w:t>
    </w:r>
  </w:p>
  <w:p w14:paraId="551666C9" w14:textId="77777777" w:rsidR="00442583" w:rsidRDefault="00442583" w:rsidP="00442583">
    <w:pPr>
      <w:pStyle w:val="ListParagraph"/>
      <w:numPr>
        <w:ilvl w:val="0"/>
        <w:numId w:val="2"/>
      </w:numPr>
      <w:ind w:left="720"/>
    </w:pPr>
    <w:r>
      <w:t>This is a list item</w:t>
    </w:r>
  </w:p>
  <w:p w14:paraId="1A7915D0" w14:textId="08BE8BEB" w:rsidR="005D0069" w:rsidRDefault="005D0069" w:rsidP="00892FBD"/>
  <w:p w14:paraId="29C5692C" w14:textId="323CB438" w:rsidR="00FB2CED" w:rsidRDefault="00FB2CED" w:rsidP="00892FBD">
    <w:r>
      <w:t>Headline:</w:t>
    </w:r>
  </w:p>
  <w:p w14:paraId="156080DF" w14:textId="24763A50" w:rsidR="00FB2CED" w:rsidRDefault="00FB2CED" w:rsidP="00FB2CED">
    <w:pPr>
      <w:pStyle w:val="ListParagraph"/>
      <w:numPr>
        <w:ilvl w:val="0"/>
        <w:numId w:val="3"/>
      </w:numPr>
    </w:pPr>
    <w:r>
      <w:t>This is a list item</w:t>
    </w:r>
  </w:p>
  <w:p w14:paraId="5DDA8D93" w14:textId="4FE9F6C2" w:rsidR="00FB2CED" w:rsidRDefault="00FB2CED" w:rsidP="00FB2CED">
    <w:pPr>
      <w:pStyle w:val="ListParagraph"/>
      <w:numPr>
        <w:ilvl w:val="0"/>
        <w:numId w:val="3"/>
      </w:numPr>
    </w:pPr>
    <w:r>
      <w:t>This is a list item</w:t>
    </w:r>
  </w:p>

The difference seems to be the existence of the <w:ind w:left="720"/> value, which I'm assuming is telling pydocx to add an indentation.

botzill commented 7 years ago

OK, will investigate this issue and try to come with a PR. If there are any suggestions, let me know.

botzill commented 7 years ago

Btw @jhubert, what is the desired output for this? We should not have that margin at all? Because I don't see any margin when opening in libreoffice, as you mentioned in first screen. But when converting from libreoffice to html I get:

screen shot 2016-12-23 at 12 48 40 pm

which is a little different from what we have with pydocx.

jhubert commented 7 years ago

I think the desired output is that the inset matches the word document. For this simple case, that should just mean removing the margin on the inner span.

botzill commented 7 years ago

Hm, but there can be cases when we actually need this margin there?

jhubert commented 7 years ago

There are definitely more complex cases, all of which I don't think are being handled properly. Here are some examples.

When the word document has this: image

The HTML output is this: image

In the first case, the nested list items are getting margin added to the content of each item but the bullet should be in line with the headline. Basically everything is wrong.

In the second case, the list should have a negative margin so the list items match the indent of the headline.

In the third case, the list should have additional margin so that it's inset more into the page than the headline.

I would call these more or less edge cases... the only one that really feels broken when looking at it is: image

So, that's probably worth spending the most time on. If the rest of them get solved in the process, hurrah! 💯

botzill commented 7 years ago

@jhubert can you also attach .docx files with this example you mention, just to have some for tests. Thx

botzill commented 7 years ago

I just don't understand when we need to ignore this margin and when we should not. Maybe @winhamwr @kylegibson can give some advice on this.

jhubert commented 7 years ago

@botzill I can't think of a time where we would want the margin next to the list item. If anything, I think there would be a case where we want the margin on the whole ul.