XLSForm / pyxform

A Python package to create XForms for ODK Collect.
BSD 2-Clause "Simplified" License
77 stars 134 forks source link

Explore alternatives for reading Excel files because xlrd has been deprecated for a long time and is reported to be unstable #501

Closed lognaturel closed 2 years ago

lognaturel commented 3 years ago

xlrd has been passively maintained since 2016: https://groups.google.com/g/python-excel/c/P6TjJgFVjMI/m/g8d0eWxTBQAJ

xlrd 2.0 was recently released and removes xlsx support. The recommended package for reading xlsx is openpyxl: http://www.python-excel.org/

The former maintainer of xlrd says "the xlsx reading in xlrd has become unreliable in Python 3.9".

Some questions to investigate:

jkpr commented 3 years ago

A lot of my tools are based on xlrd, so I will have to change over to openpyxl in the near future. Pandas prefers openpyxl for xlsx files, so that is a vote of confidence in some sense.

I checked my emails with our partners in Africa and south Asia, and distressingly, many people still use the .xls format. I do not know why. Dropping xls entirely is probably not good.

In the meantime, pyxform can depend on a pinned version of xlrd that does support .xlsx, right, e.g. v1.x.y? Then, couldn't we move forward with using xlrd for .xls and openpyxl for .xlsx?

The only problem we know about is Python 3.9 using xlrd for xlsx, so wouldn't the extent of the problem be related to the level of Python 3.9 usage? I am on Python 3.8 currently, just recently upgraded. I do not know when I will switch to 3.9.

KeynesYouDigIt commented 2 years ago

@jkpr just trying to follow along - when you say "A lot of my tools are based on xlrd," - are you talking about tools or functions inside of pyxform?

KeynesYouDigIt commented 2 years ago

If Pandas likes openpyxl then I think that endorsement is extremely strong. I'd be happy to take a shot at encapsulating all of our uses of xlrd into an internal subpackage, which would make the transition work much easier to reason about, thoughts?

Another thought out of left fieldIf it's not to heavy, then even using Pandas might make sense, given how widespread that tool is - if we just used Pandas, then the Pandas team could worry about what library to use when reading excel spreadsheets

jkpr commented 2 years ago

I mean that a lot of the personal development tools that I use are based on xlrd. They are separate from pyxform. They read excel files and do things.

jnm commented 2 years ago

I haven't used this enough to really suggest it, and clearly it doesn't have the traction of openpyxl, but this package got me out of a bind when I needed quick support for Python 3.9: https://github.com/thombashi/excelrd. It's a drop-in replacement for xlrd.

Pandas is a real behemoth that I'd prefer not to see as a dependency. To be fair, installing it isn't as bad as it used to be thanks to binary wheels, if those are available on your platform.

lognaturel commented 2 years ago

Thanks for chiming in, all.

@yanokwa was there something in particular prompting #555 or just wanting to keep up with updates? I see Python 3.8 is going to get security updates until 14 Oct 2024 so I don't feel huge amounts of pressure to support Python 3.9 immediately. I think if there's some kind of pressure, @jnm's suggestion of using a drop-in replacement initially is a good one. Maybe that's a good first move regardless.

I've been looking at file extensions of forms I see and as @jkpr said, there are a surprising number of xls docs. I think it's largely because ODK is so old -- some folks have been working off the same base form for a decade+. Saving an existing xls file as xlsx doesn't seem like a big deal but it's the kind of thing our users might struggle with. Still, I'd rather not have both xlrd and openpyxl in play if we can help it. Maybe we can announce the dropping of xls files a year in advance and then switch entirely to openpyxl?

@jkpr what's your primary Python version looking like these days? Have you made the switch to openpyxl in some of your other projects and if so, how did it go?

jkpr commented 2 years ago

For my current work, I have been reading Excel files with Pandas. There always seem to be a few edge cases, but overall everything has its solution.

lognaturel commented 2 years ago

Great! I agree with @jnm about not bringing in all of Pandas. But your experience is a positive for the underlying openpyxl.

yanokwa commented 2 years ago

@lognaturel I'm not arguing for removal of Python 3.8. I'm more arguing for addition of Python 3.9 since it's been out for a year.

I'd bias to using openpyxl and putting a warning in pyxform for the next year to get people to save as xlsx. That's mostly because I do prefer using a library that is in pandas and has a long history.

That said, I have no idea why xls is so bad of a format. It seems to work fine, but I guess xlsx is a more open format?

lindsay-stevens commented 2 years ago

TLDR proposal:

The problem for xlrd seems to have been that Python 3.9 changed the ElementTree API in a way that broke it's usage with defusedxml. Arguably Pyxform should have been following the now-deleted advice from xlrd to use defusedxml when processing XLSX files. Issues were deactivated in the xlrd repo but some context is here: https://github.com/tiran/defusedxml/issues/48

As mentioned in the groups thread the point of defusedxml here is that the XLSX file is a zip file containing XML, and someone could just open up a XLSX file and replace a normal XML file with a XML bomb or similar. This in turn would be a DOS problem for anyone using pyxform to process untrusted XLSX files. Since there is an intent to move Pyxform to Python 3.9, instead of just adding a dependency on defusedxml we might as well migrate to Openpyxl + defusedxml for XLSX files.

There seems to be a use case in this ticket for keeping XLS support for the time being. On the one hand it has not been the default Excel format since 2007 so it would be reasonable to end support at some stage (maybe Python 3.10 / in about a year?). On the other hand pyxform is implementing a XLSForm spec so it would be reasonable to expect it to support XLS.

If xlrd breaks in future, unfortunately many python modules dealing with XLS are actually xlrd wrappers (pyexcel-xls, excelrd, pandas). Presumably if xlrd breaks then these larger projects will want to come up with a Python solution, or adopt xlrd to maintain it. If nobody fixes xlrd, some options for pyxform to continue XLS support could be:

Or if none of these are feasible then as Yaw said, add a deprecation warning for XLS files and encourage users to convert their files to XLSX first with Excel or LibreOffice Calc.

In terms of the time cost to support openpyxl, it's not trivial but not heinous either. Overall the API usage of xlrd is minimal: open a workbook, iterate through sheets / rows / columns, check cell types / date modes, catch processing errors. The below are extracted lines either directly or transitively using xlrd in pyxform. The method names / patterns for this are different in openpyxl but the concepts would translate.

utils.py

wb = xlrd.open_workbook(workbook_path)
sheet = wb.sheet_by_name(sheet_name)
except xlrd.biffh.XLRDError:

xls2json_backends.py

import xlrd
from xlrd import XLRDError
from xlrd.xldate import XLDateAmbiguous

workbook = xlrd.open_workbook(filename=path_or_file)
workbook = xlrd.open_workbook(file_contents=path_or_file.read())
except XLRDError as error:
except XLDateAmbiguous:
if value_type == xlrd.XL_CELL_BOOLEAN:
elif value_type == xlrd.XL_CELL_NUMBER:
elif value_type is xlrd.XL_CELL_DATE:
datetime_or_time_only = xlrd.xldate_as_tuple(value, datemode)

row_dict[key] = xls_value_to_unicode(value, value_type, workbook.datemode)
for sheet in workbook.sheets():
if len(workbook.sheets()) == 1:

for row in range(1, sheet.nrows):
for column in range(0, sheet.ncols):
key = "%s" % sheet.cell_value(0, column)
value = sheet.cell_value(row, column)
value_type = sheet.cell_type(row, column)
result[sheet.name] = []
RafaelKluender commented 2 years ago

Dear ODK team, I just ran into that and solved the problem by installing a second python environment using 3.8. Do you have any idea when pyxform will be made compatible with python 3.9?

Many thanks for your work.

sheppard commented 2 years ago

Hi all, I just submitted #575, which enables openpyxl for XLSX files, making it possible to upgrade xlrd to v2. (I recently did this for another project so the necessary changes were fresh in my memory.) One nice thing is that openpyxl does a better job of generating the appropriate Python types for formatted date/time/datetime and integer values.

As for the question of whether/when to drop XLS support, a phased approach may be appropriate:

Phase 1: Release next version with both xlrd and openpyxl dependencies Phase 2: (Optionally) Isolate each backend into a separate file (e.g. pyxform/backends/xlsx.py, pyxform/backends/xls.py). Phase 3: Move xlrd from install_requires to an entry in extras_require and make all xlrd imports conditional (See this example). Users who pip install pyxform will not have XLS support by default, but those who need it would pip install pyxform[oldexcel] instead. Phase 4: Remove XLS support entirely.

lindsay-stevens commented 2 years ago

Openpyxl support is merged - thanks @sheppard!

A thread to discuss the proposed deprecation plan has been opened on the ODK forum.