IAMconsortium / pyam

Analysis & visualization of energy & climate scenarios
https://pyam-iamc.readthedocs.io/
Apache License 2.0
236 stars 120 forks source link

kwargs are not passed to `pandas.ExcelFile` when reading Excel files with `IamDataFrame` #876

Closed korsbakken closed 2 months ago

korsbakken commented 2 months ago

When reading an Excel file into a new IamDataFrame instance, IamDataFrame.__init__ propagates keyword arguments to read_file and further to read_pandas. But in both of these methods, the Excel file is read by creating a pandas.ExcelFile instance with no kwargs. This results in an exception if you use kwargs that need to be passed to the `pandas.ExcelFile' init method.

In particular, this affects the engine kwarg. Pandas uses openpyxl as the Excel engine by default. But openpyxl is quite slow. For a moderately large IAMC-format Excel file in my current project (22 MB, 153k rows, 24 year columns) I see a 10x speed-up if I read it with pandas.read_excel and use engine="calamine" instead of the default, about 3 second load time vs. around 30 with openpyxl.

The total times are similar if I read the file by passing it to pyam.IamDataFrame.__init__ (which then uses openpyxl) vs first reading it with pandas.read_excel and engine="calamine" and then pass the resulting DataFrame to pyam.IamDataFrame. Which is a workable workaround, but not very elegant, and results in quite a bit of extra boilerplate if you also need to read metadata.

The fix should be relatively simple: In the two places where pd.ExcelFile is used (core.py line 200 in IamDataFrame._init, and utils.py line 103, in read_pandas), search kwargs for any keywords that are accepted by pandas.ExcelFile (including engine), pop those off the kwargs and send them to pd.ExcelFile instead.

I'm working on an implementation of this locally that I can submit as a PR. But please let me know if you see a reason why kwargs should be handled in the original way instead. Which basically means any reason that the user should not be able to specify engine or storage_options when reading from an Excel file.

korsbakken commented 2 months ago

Somebody got hacked here...

danielhuppmann commented 2 months ago

Seems that is a new spam strategy, I deleted the comments.

danielhuppmann commented 2 months ago

Thanks @korsbakken, great suggestion!

Pulling out the relevant kwargs to initialise the ExcelWriter would be a very nice improvement (keeping in mind that other kwargs are used to modify the parsed dataframe).

Related to that, we also had the idea to create an own classmethod to read xlsx files, see #690.

korsbakken commented 2 months ago

Just to clarify, this is about reading Excel files. I.e., being able to use the engine keyword in IamDataFrame.__init__, when constructing an IamDataFrame instance from an Excel file Path or filename. Calamine doesn't write Excel files, or at least pandas doesn't currently support using it to do that. Calamine is a wrapper around a Rust package that can read Excel data (ignoring formatting) very fast by using a Rust binary. The package is python-calamine.

You could probably enable using the engine keyword when writing Excel using a similar approach to what I propose, but you might not get the same kind of speed up with any of the engines that pandas currently support for writing Excel files as opposed to reading them. And the edits would have to be made in different places, obviously.

danielhuppmann commented 2 months ago

Sorry for the confusion, I meant ExcelFile instead of ExcelWriter...

If you can start a PR, that would be very much appreciated!

korsbakken commented 2 months ago

I've submitted an initial PR now, #877.