Closed lognaturel closed 2 years ago
A lot of my tools are based on xlrd
, so I will have to change over to openpyxl
in the near future. Pandas prefers openpyxl
for xlsx
files, so that is a vote of confidence in some sense.
I checked my emails with our partners in Africa and south Asia, and distressingly, many people still use the .xls
format. I do not know why. Dropping xls
entirely is probably not good.
In the meantime, pyxform
can depend on a pinned version of xlrd
that does support .xlsx
, right, e.g. v1.x.y
? Then, couldn't we move forward with using xlrd
for .xls
and openpyxl
for .xlsx
?
The only problem we know about is Python 3.9 using xlrd
for xlsx
, so wouldn't the extent of the problem be related to the level of Python 3.9 usage? I am on Python 3.8 currently, just recently upgraded. I do not know when I will switch to 3.9.
@jkpr just trying to follow along - when you say "A lot of my tools are based on xlrd," - are you talking about tools or functions inside of pyxform?
If Pandas likes openpyxl then I think that endorsement is extremely strong. I'd be happy to take a shot at encapsulating all of our uses of xlrd into an internal subpackage, which would make the transition work much easier to reason about, thoughts?
Another thought out of left fieldIf it's not to heavy, then even using Pandas might make sense, given how widespread that tool is - if we just used Pandas, then the Pandas team could worry about what library to use when reading excel spreadsheets
I mean that a lot of the personal development tools that I use are based on xlrd. They are separate from pyxform. They read excel files and do things.
I haven't used this enough to really suggest it, and clearly it doesn't have the traction of openpyxl, but this package got me out of a bind when I needed quick support for Python 3.9: https://github.com/thombashi/excelrd. It's a drop-in replacement for xlrd.
Pandas is a real behemoth that I'd prefer not to see as a dependency. To be fair, installing it isn't as bad as it used to be thanks to binary wheels, if those are available on your platform.
Thanks for chiming in, all.
@yanokwa was there something in particular prompting #555 or just wanting to keep up with updates? I see Python 3.8 is going to get security updates until 14 Oct 2024 so I don't feel huge amounts of pressure to support Python 3.9 immediately. I think if there's some kind of pressure, @jnm's suggestion of using a drop-in replacement initially is a good one. Maybe that's a good first move regardless.
I've been looking at file extensions of forms I see and as @jkpr said, there are a surprising number of xls docs. I think it's largely because ODK is so old -- some folks have been working off the same base form for a decade+. Saving an existing xls file as xlsx doesn't seem like a big deal but it's the kind of thing our users might struggle with. Still, I'd rather not have both xlrd and openpyxl in play if we can help it. Maybe we can announce the dropping of xls files a year in advance and then switch entirely to openpyxl?
@jkpr what's your primary Python version looking like these days? Have you made the switch to openpyxl
in some of your other projects and if so, how did it go?
For my current work, I have been reading Excel files with Pandas. There always seem to be a few edge cases, but overall everything has its solution.
Great! I agree with @jnm about not bringing in all of Pandas. But your experience is a positive for the underlying openpyxl
.
@lognaturel I'm not arguing for removal of Python 3.8. I'm more arguing for addition of Python 3.9 since it's been out for a year.
I'd bias to using openpyxl and putting a warning in pyxform for the next year to get people to save as xlsx. That's mostly because I do prefer using a library that is in pandas and has a long history.
That said, I have no idea why xls is so bad of a format. It seems to work fine, but I guess xlsx is a more open format?
TLDR proposal:
The problem for xlrd seems to have been that Python 3.9 changed the ElementTree API in a way that broke it's usage with defusedxml. Arguably Pyxform should have been following the now-deleted advice from xlrd to use defusedxml when processing XLSX files. Issues were deactivated in the xlrd repo but some context is here: https://github.com/tiran/defusedxml/issues/48
As mentioned in the groups thread the point of defusedxml here is that the XLSX file is a zip file containing XML, and someone could just open up a XLSX file and replace a normal XML file with a XML bomb or similar. This in turn would be a DOS problem for anyone using pyxform to process untrusted XLSX files. Since there is an intent to move Pyxform to Python 3.9, instead of just adding a dependency on defusedxml we might as well migrate to Openpyxl + defusedxml for XLSX files.
There seems to be a use case in this ticket for keeping XLS support for the time being. On the one hand it has not been the default Excel format since 2007 so it would be reasonable to end support at some stage (maybe Python 3.10 / in about a year?). On the other hand pyxform is implementing a XLSForm spec so it would be reasonable to expect it to support XLS.
If xlrd breaks in future, unfortunately many python modules dealing with XLS are actually xlrd wrappers (pyexcel-xls, excelrd, pandas). Presumably if xlrd breaks then these larger projects will want to come up with a Python solution, or adopt xlrd to maintain it. If nobody fixes xlrd, some options for pyxform to continue XLS support could be:
Or if none of these are feasible then as Yaw said, add a deprecation warning for XLS files and encourage users to convert their files to XLSX first with Excel or LibreOffice Calc.
In terms of the time cost to support openpyxl, it's not trivial but not heinous either. Overall the API usage of xlrd is minimal: open a workbook, iterate through sheets / rows / columns, check cell types / date modes, catch processing errors. The below are extracted lines either directly or transitively using xlrd in pyxform. The method names / patterns for this are different in openpyxl but the concepts would translate.
utils.py
wb = xlrd.open_workbook(workbook_path)
sheet = wb.sheet_by_name(sheet_name)
except xlrd.biffh.XLRDError:
xls2json_backends.py
import xlrd
from xlrd import XLRDError
from xlrd.xldate import XLDateAmbiguous
workbook = xlrd.open_workbook(filename=path_or_file)
workbook = xlrd.open_workbook(file_contents=path_or_file.read())
except XLRDError as error:
except XLDateAmbiguous:
if value_type == xlrd.XL_CELL_BOOLEAN:
elif value_type == xlrd.XL_CELL_NUMBER:
elif value_type is xlrd.XL_CELL_DATE:
datetime_or_time_only = xlrd.xldate_as_tuple(value, datemode)
row_dict[key] = xls_value_to_unicode(value, value_type, workbook.datemode)
for sheet in workbook.sheets():
if len(workbook.sheets()) == 1:
for row in range(1, sheet.nrows):
for column in range(0, sheet.ncols):
key = "%s" % sheet.cell_value(0, column)
value = sheet.cell_value(row, column)
value_type = sheet.cell_type(row, column)
result[sheet.name] = []
Dear ODK team, I just ran into that and solved the problem by installing a second python environment using 3.8. Do you have any idea when pyxform will be made compatible with python 3.9?
Many thanks for your work.
Hi all, I just submitted #575, which enables openpyxl
for XLSX files, making it possible to upgrade xlrd
to v2. (I recently did this for another project so the necessary changes were fresh in my memory.) One nice thing is that openpyxl
does a better job of generating the appropriate Python types for formatted date/time/datetime and integer values.
As for the question of whether/when to drop XLS support, a phased approach may be appropriate:
Phase 1: Release next version with both xlrd
and openpyxl
dependencies
Phase 2: (Optionally) Isolate each backend into a separate file (e.g. pyxform/backends/xlsx.py
, pyxform/backends/xls.py
).
Phase 3: Move xlrd
from install_requires
to an entry in extras_require
and make all xlrd
imports conditional (See this example). Users who pip install pyxform
will not have XLS support by default, but those who need it would pip install pyxform[oldexcel]
instead.
Phase 4: Remove XLS support entirely.
Openpyxl support is merged - thanks @sheppard!
A thread to discuss the proposed deprecation plan has been opened on the ODK forum.
xlrd has been passively maintained since 2016: https://groups.google.com/g/python-excel/c/P6TjJgFVjMI/m/g8d0eWxTBQAJ
xlrd 2.0 was recently released and removes xlsx support. The recommended package for reading xlsx is openpyxl: http://www.python-excel.org/
The former maintainer of xlrd says "the xlsx reading in xlrd has become unreliable in Python 3.9".
Some questions to investigate: