booktype / python-ooxml

Python library for parsing .docx (Office Open XML) files
GNU Affero General Public License v3.0
52 stars 24 forks source link

Underline Tags #32

Open alanbacon opened 8 years ago

alanbacon commented 8 years ago

Underline tags in docx files are missed. The offending lines are parse.py:88 and parse.py:31 on commit 833e658.

The master branch treats the underline tag as having only two possible states, on or off and does not account for the fact the underline tag will actually contain string values such as 'single', 'double', 'dashed' etc. I have a branch of the code that will update the rpr dictionary accordingly by altering the code around the two lines that I mentioned, a 'u' field of the dictionary will be added with a string value representing the type of underlining.

However I do not know what further implication this will have. Does another part of this project assume that the 'u' field of rpr will either not exist or take on a true or false value.

danielhjames commented 8 years ago

I see in https://msdn.microsoft.com/en-us/library/office/ff822388%28v=office.15%29.aspx that many types of underlining are possible. So is the intention that, for example:

<w:r>
  <w:rPr>
    <w:u w:val="double"/>
  </w:rPr>
  <w:t>double underlined</w:t>
</w:r>

would become this in the HTML output?

<p>
  <span class="double">double underlined</span>
</p>

Sadly it seems text-decoration-style: double (https://developer.mozilla.org/en-US/docs/Web/CSS/text-decoration-style) only works in Firefox for now.

aerkalov commented 8 years ago

@alanbacon I would say that nothing will be broken with the changes you made. The only problem is that we have 1/0 for the underscore value. For the next step I assume we would need to preserve information what kind of underscore line it as and implement serializers for undersocre (and bold, italic, ...). That would be for the case when we only want to serialize text with single line or use different kind of implementations for underscore. That would help us with different requirements in projects. As it is with your change, it would always end up as single line.

BTW, are you using this lib somewhere? What are your needs? More concerned with parsing part or the serialization?

alanbacon commented 8 years ago

The only problem is that we have 1/0 for the underscore value

Yes this is why I was concerned that something somewhere might break. The underline status should really be stored a string (or possible an enum) not a boolean. My needs are few, I use this library to read .docx files into the python workspace, once I have the python document object I write my own custom code from then to extract the data I'm interesting in into my own structure format. This library was very helpful for me as I did not have to have any great understanding of the ooxml file structure.

alanbacon commented 7 years ago

Pull request submitted: https://github.com/booktype/python-ooxml/pull/34