CenterForOpenScience / pydocx

An extendable docx file format parser and converter
Other
183 stars 55 forks source link

Space between item listing/tables #228

Open botzill opened 7 years ago

botzill commented 7 years ago

Currently if we have a list like:

screen shot 2017-02-07 at 9 45 51 pm

is exported as:

screen shot 2017-02-07 at 9 45 58 pm

So, there is no space between items.

This is applied to tables as well: input:

screen shot 2017-02-07 at 10 07 40 pm

output:

screen shot 2017-02-07 at 10 07 48 pm
botzill commented 7 years ago

As I check the code I see that this was deliberately done, via:

def export_paragraph(self, paragraph):
    results = super(PyDocXHTMLExporter, self).export_paragraph(paragraph)

    results = is_not_empty_and_not_only_whitespace(results)
    if results is None:
        return

Any reason why we do that?

Basically I think that we need to detect empty paragraph and convert them into <br/> to have proper output.

kylegibson commented 7 years ago

If I recall correctly, it's because word documents can have these blank p's, but don't actually render to anything in a document. Empty p's in OOXML do not necessarily translate to a line break in HTML. If in doubt, 1) check the spec: how does it say empty p's should be handled? 2) construct a word document with some empty p's. Open the document in Word. What happens?

botzill commented 7 years ago

Yes, I did some tests and basically if we add an empty <w:p/> it will be rendered as new line. Of course there can be different scenarios about this depending where <w:p/> is located. To be honest I could not find proper information about empty p, I just did tests with doc.

I did some work related to this here: https://github.com/botzill/pydocx/commit/34ee04591e324511880eed52f8fc0757e4360917.

To properly allow <w:p/> to be rendered we need to reset html p tag default margins and allow those empty p to be processed. Empty paragraph is replaced with: <p>&nbsp;</p> so that it will work in lists as well.

This way we don't actually need this method : https://github.com/CenterForOpenScience/pydocx/blob/9cd76eeb1f99cb3e580a8138a00295087f86eae0/pydocx/export/base.py#L255.

But not sure yet if this will cover all the cases. From tests I did seems be fine so far.

botzill commented 7 years ago

The info I found about p: https://msdn.microsoft.com/en-us/library/gg278323.aspx

The most basic unit of block-level content within a WordprocessingML document, paragraphs are stored using the <p> element. A paragraph defines a distinct division of content that begins on a new line. A paragraph can contain three pieces of information: optional paragraph properties, inline content (typically runs), and a set of optional revision IDs used to compare the content of two documents.

Also here: https://msdn.microsoft.com/en-us/library/documentformat.openxml.wordprocessing.paragraph.aspx. But no info related to empty paragraphs.