eea / odfpy

API for OpenDocument in Python
GNU General Public License v2.0
308 stars 63 forks source link

lost spaces in spreadsheet text cells #95

Open joernhees opened 4 years ago

joernhees commented 4 years ago

debugging https://github.com/pandas-dev/pandas/issues/32207 i now think that this probably belongs here somewhere...

Code Sample

Create a new spreadsheet with 1 column "testcol" in LibreOffice / OpenOffice & Excel, save as test.ods/test.xlsx:

testcol
this is great
4    spaces
1 trailing 
 1 leading
2  spaces  multiple  times

For simplicity here as zip: (1 ods, 1 xlsx): spreadsheets.zip

import pandas as pd
for i in pd.read_excel('test.ods', engine='odf')['testcol']:
    print(repr(i))

# output:
'this is great'
'4 spaces'
'1 trailing '
'1 leading'
'2 spaces multiple times'

for i in pd.read_excel('test.xlsx')['testcol']:
    print(repr(i))

# output:
'this is great'
'4    spaces'
'1 trailing '
' 1 leading'
'2  spaces  multiple  times'

Problem description

When reading .ods files (OpenOffice or LibreOffice) multiple spaces are collapsed into one, leading ones are lost, trailing ones preserved.

Expected Output

See excel output above.

debugging so far:

Digging into this, it seems that pandas when getting the cell's value here actually already gets a cell from which the original string isn't re-constructable. In the debugger it seems that the cell 4 spaces is actually already parsed into 3 childnodes, where the ' ' end up as an Element that doesn't print its whitespace only values when str(cell) is called:

image

Sadly, at this point i end up running into your parsing code... and i have to say that i'm lost...

detrout commented 4 years ago

I looked at the ods files provided by @joernhees and discovered the tag which I hadn't known about and so didn't parse.

It looks like https://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part1.html#__RefHeading__1415200_253892949 is the relevant documentation of what to do.