PydPiper / pylightxl

A light weight, zero dependency, minimal functionality excel read/writer python library
https://pylightxl.readthedocs.io
MIT License
303 stars 47 forks source link

Incorrect values for spreadsheets with large amounts of text #20

Closed hgcummings closed 4 years ago

hgcummings commented 4 years ago

In spreadsheets with large amounts of text, I'm finding that reading text from one field sometimes gives me the text (or even partial text) of another field.

As an example, using this publicly available Excel document: https://digital.nhs.uk/binaries/content/assets/legacy/excel/8/c/hes_data_dictionary_20170914.xlsx (note that there's a Content-Disposition header that means it will save under a different filename)

For example, print(db.ws('APC').address(address='H276')) should give "Age at start of the episode (STARTAGE), with decimalised values for babies." but it actually gives "Rule # 631, 641, 651 , 660 and 671", which appears to be the contents of cell J43 in a different sheet (OP). Requesting the contents of that cell (print(db.ws('OP').address(address='J43'))) gives the text from K32 in sheet APC.

Meanwhile, print(db.ws('APC').address(address='H261')) gives "34 Decision not to treat - decision not to treat made or no further contact required", which is one line from the middle of a large text cell (which is repeated in all three sheets, so I'm not exactly sure which one its coming from).

Other possibly useful observations: I certainly see correct results for large portions of the file. It seems to start going wrong about halfway through. The errors seem to be consistently reproducible (i.e. I get the exact same wrong text for each cell each time). Deleting all the contents of one of the text-heavy columns (column I) seemed to resolve the issue (but undoing this change and saving again caused it to come back). I guess there's some subtlety in how the Excel file format references text internally that's throwing off pylightxl.

Let me know if I can provide any more information.

PydPiper commented 4 years ago

Hi there, thank you for taking a look at pylightxl and submitting this to our attention! We are taking a look at it now.

PydPiper commented 4 years ago

This is a confirmed bug. We will take a look at the fix this week. I'll keep you posted on updates! :) thanks again for submitting this!

PydPiper commented 4 years ago

Issue identified: a single cell with multiple text formats within the cell is split up differently in sharedString.xml than normal string.

2020-08-22 17_16_32-C__Users_vkisf_Desktop_test_sharedStrings xml - Internet Explorer

working on accounting for these cases now

PydPiper commented 4 years ago

the latest commit on master (https://github.com/PydPiper/pylightxl/commit/644313050a41835890dc0195604746947d96c2a6) has the fix. Please give it a try and let us know if your issue still persists.

We are currently in the process or writing unit test against this and will push it to the next version.

hgcummings commented 4 years ago

Sorry for not responding sooner. Thanks for picking this up. I can confirm that my original issue is resolved in v1.46