IAMconsortium / pyam

Analysis & visualization of energy & climate scenarios
https://pyam-iamc.readthedocs.io/
Apache License 2.0
236 stars 120 forks source link

Handling kwargs to be passed to pandas.ExcelFile #877

Closed korsbakken closed 2 months ago

korsbakken commented 2 months ago

Detects kwargs passed to IamDataFrame.__init__ and to read_pandas that need to be passed to pandas.ExcelFile.__init__ to be handled properly.

This PR closes #876

Please confirm that this PR has done the following:

Tests: I have run existing tests, and tested locally that the new code reads Excel files correctly with engine="calamine" as a kwarg to pyam.IamDataFrame. But a new test under /tests/ that tests the added/fixed functionality (rather than just regression testing) would require that probably aren't intended to be part of the core requirements. If the lead developers would like me to do this, I would appreciate guidance on how to add a test that requires extra optional packages, without breaking the existing test suite.

Documentation: As far as I can tell, this is a bug fix, and shouldn't require new documentation. The one new helper function, utils.get_excel_file_with_kwargs, has a docstring that explains how it's used.

Name in AUTHORS.rst: This is a fairly trivial addition, and I don't need my name in the AUTHORS file for it. But I can add it in a update commit if you do want it there.

Description of PR

The PR modifies lines in pyam.core.IamDataFrame.__init__ and pyam.utils.read_pandas that create a pandas.ExcelFile instance to read an Excel file. The modifications ensure that keyword arguments that are recognized by pandas.ExcelFile.__init__ are passed on to the pandas.ExcelFile instance and removed from calls to later methods (which would cause, e.g., pandas.read_excel to raise an exception).

The modifications add a new helper function get_excel_file_with_kwargs to pyam.utils. The function takes an excel file path and arbitrary kwargs, extracts the keyword arguments that are accepted by pandas.ExcelFile.__init__, creates a new ExcelFile instance with those keyword arguments, and returns it along with a dict of the other keyword arguments that were not used.

The new functionality makes it possible to use the keyword arguments engine, storage_options and engine_kwargs with IamDataFrame.__init__ and utils.read_pandas when reading Excel files, which could cause an exception to be raised in the previous version.

danielhuppmann commented 2 months ago

Thanks!

One suggestion for adding a test: add a few more keyword arguments, like header, usecols, skiprows, nrows, then make a copy of https://github.com/IAMconsortium/pyam/blob/bc56c53f5a835b857aadab21e4a3f8cb74d9dab6/tests/test_io.py#L116 with

import_df = IamDataFrame(TEST_DATA_DIR / "test_df.xls", nrows=2)
assert_iamframe_equal(test_df_year.filter(scenario="scen_a"), import_df)
codecov[bot] commented 2 months ago

Codecov Report

Attention: Patch coverage is 96.29630% with 1 line in your changes missing coverage. Please review.

Project coverage is 95.0%. Comparing base (ddbb88e) to head (6a300c2). Report is 32 commits behind head on main.

Files Patch % Lines
pyam/utils.py 91.6% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #877 +/- ## ===================================== Coverage 95.0% 95.0% ===================================== Files 64 64 Lines 6134 6216 +82 ===================================== + Hits 5828 5910 +82 Misses 306 306 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

danielhuppmann commented 2 months ago

I guess that the pytest-legacy test (with pandas 2.1.2) is failing because engine_kwargs={"data_only": False} is not supported there? Can you add a pytest-skip for versions lower than 2.2 for that test? Then this PR is good to be merged!

korsbakken commented 2 months ago

I guess that the pytest-legacy test (with pandas 2.1.2) is failing because engine_kwargs={"data_only": False} is not supported there? Can you add a pytest-skip for versions lower than 2.2 for that test? Then this PR is good to be merged!

Yes. In fact, both the added tests fail with pandas < 2.2 for different reasons. test_read_xlsx_calamine fails because calamine wasn't added as an engine before 2.2.0. And test_read_xlsx_kwargs fails because pandas included hard-coded engine_kwargs that couldn't be overridden until the same version. I will add a warning about the latter if anyone tries to use engine_kwargs with pandas < 2.2 in the main code, and skip the tests in that case.

korsbakken commented 2 months ago

I pushed a new version now, with a warning if anyone tries to use engine_kwargs with pandas version lower than 2.2.0, and skipping the tests if pandas is lower than that version.

In order to actually test the new functionality in CI, I also added a new optional dependency group calamine to pyproject.toml and updated the pytest workflows to include it. In order to make that work, I also had to update poetry.lock. If you would rather not include that, you can either reject the last two commits if that's possible, or just let me know and I will push a revert commit.