Closed korsbakken closed 2 months ago
Somebody got hacked here...
Seems that is a new spam strategy, I deleted the comments.
Thanks @korsbakken, great suggestion!
Pulling out the relevant kwargs to initialise the ExcelWriter would be a very nice improvement (keeping in mind that other kwargs are used to modify the parsed dataframe).
Related to that, we also had the idea to create an own classmethod to read xlsx files, see #690.
Just to clarify, this is about reading Excel files. I.e., being able to use the engine
keyword in IamDataFrame.__init__
, when constructing an IamDataFrame
instance from an Excel file Path or filename. Calamine doesn't write Excel files, or at least pandas doesn't currently support using it to do that. Calamine is a wrapper around a Rust package that can read Excel data (ignoring formatting) very fast by using a Rust binary. The package is python-calamine
.
You could probably enable using the engine
keyword when writing Excel using a similar approach to what I propose, but you might not get the same kind of speed up with any of the engines that pandas currently support for writing Excel files as opposed to reading them. And the edits would have to be made in different places, obviously.
Sorry for the confusion, I meant ExcelFile
instead of ExcelWriter
...
If you can start a PR, that would be very much appreciated!
I've submitted an initial PR now, #877.
When reading an Excel file into a new
IamDataFrame
instance,IamDataFrame.__init__
propagates keyword arguments toread_file
and further toread_pandas
. But in both of these methods, the Excel file is read by creating apandas.ExcelFile
instance with no kwargs. This results in an exception if you use kwargs that need to be passed to the `pandas.ExcelFile' init method.In particular, this affects the
engine
kwarg. Pandas uses openpyxl as the Excel engine by default. But openpyxl is quite slow. For a moderately large IAMC-format Excel file in my current project (22 MB, 153k rows, 24 year columns) I see a 10x speed-up if I read it withpandas.read_excel
and useengine="calamine"
instead of the default, about 3 second load time vs. around 30 with openpyxl.The total times are similar if I read the file by passing it to
pyam.IamDataFrame.__init__
(which then uses openpyxl) vs first reading it withpandas.read_excel
andengine="calamine"
and then pass the resulting DataFrame topyam.IamDataFrame
. Which is a workable workaround, but not very elegant, and results in quite a bit of extra boilerplate if you also need to read metadata.The fix should be relatively simple: In the two places where
pd.ExcelFile
is used (core.py line 200 inIamDataFrame._init
, and utils.py line 103, inread_pandas
), searchkwargs
for any keywords that are accepted bypandas.ExcelFile
(includingengine
), pop those off the kwargs and send them topd.ExcelFile
instead.I'm working on an implementation of this locally that I can submit as a PR. But please let me know if you see a reason why kwargs should be handled in the original way instead. Which basically means any reason that the user should not be able to specify
engine
orstorage_options
when reading from an Excel file.