bdilday / pychadwick

Python package to interface with chadwick library
GNU General Public License v2.0
9 stars 4 forks source link

game[s]_to_dataframe failing on pandas >=2 #30

Closed NickBall closed 1 year ago

NickBall commented 1 year ago

Description

On pandas 2.0+, chadwick.games_to_dataframe and game_to_dataframe are failing due to Pandas more strictly handling the casting of the dtype specified in the initialization of DataFrames.

This can be reproduced on python 3.8+ and pandas 2.0.2 with the existing test_pychadwick.py::test_load_games_to_df unit test:

(venv3.9) nick@astra:~/dev/nickball/forks/pychadwick$ python3 --version
Python 3.9.16
(venv3.9) nick@astra:~/dev/nickball/forks/pychadwick$ pip uninstall -y pandas && make install
(venv3.9) nick@astra:~/dev/nickball/forks/pychadwick$ pip freeze | grep pandas
pandas==2.0.2
(venv3.9) nick@astra:~/dev/nickball/forks/pychadwick$ pytest tests/
============================================================ test session starts =============================================================
platform linux -- Python 3.9.16, pytest-5.4.3, py-1.11.0, pluggy-0.13.1
rootdir: /home/nick/dev/nickball/forks/pychadwick
collected 6 items

tests/pychadwick/chadwick/test_pychadwick.py ..F...                                                                                    [100%]

================================================================== FAILURES ==================================================================
___________________________________________________________ test_load_games_to_df ____________________________________________________________

chadwick = <pychadwick.chadwick.Chadwick object at 0x7f6f682d06d0>, team_events = ['1982OAK.EVA', '1991BAL.EVA', '1954PHI.EVN']

>   ???

/home/nick/temp/pychadwick_fork/tests/pychadwick/chadwick/test_pychadwick.py:52:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
venv3.9/lib/python3.9/site-packages/pychadwick-0.5.0-py3.9-linux-x86_64.egg/pychadwick/chadwick.py:247: in games_to_dataframe
    dfs = [
venv3.9/lib/python3.9/site-packages/pychadwick-0.5.0-py3.9-linux-x86_64.egg/pychadwick/chadwick.py:248: in <listcomp>
    pd.DataFrame(list(self.process_game(game_ptr)), dtype="f8")
venv3.9/lib/python3.9/site-packages/pandas/core/frame.py:790: in __init__
    mgr = arrays_to_mgr(
venv3.9/lib/python3.9/site-packages/pandas/core/internals/construction.py:120: in arrays_to_mgr
    arrays, refs = _homogenize(arrays, index, dtype)
venv3.9/lib/python3.9/site-packages/pandas/core/internals/construction.py:607: in _homogenize
    val = sanitize_array(val, index, dtype=dtype, copy=False)
venv3.9/lib/python3.9/site-packages/pandas/core/construction.py:576: in sanitize_array
    subarr = _try_cast(data, dtype, copy)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

arr = array(['OAK198204060', 'OAK198204060', 'OAK198204060', 'OAK198204060',
       'OAK198204060', 'OAK198204060', 'OAK1982..., 'OAK198204060', 'OAK198204060', 'OAK198204060',
       'OAK198204060', 'OAK198204060', 'OAK198204060'], dtype=object)
dtype = dtype('float64'), copy = False

    def _try_cast(
        arr: list | np.ndarray,
        dtype: np.dtype,
        copy: bool,
    ) -> ArrayLike:
        """
        Convert input to numpy ndarray and optionally cast to a given dtype.

        Parameters
        ----------
        arr : ndarray or list
            Excludes: ExtensionArray, Series, Index.
        dtype : np.dtype
        copy : bool
            If False, don't copy the data if not needed.

        Returns
        -------
        np.ndarray or ExtensionArray
        """
        is_ndarray = isinstance(arr, np.ndarray)

        if is_object_dtype(dtype):
            if not is_ndarray:
                subarr = construct_1d_object_array_from_listlike(arr)
                return subarr
            return ensure_wrapped_if_datetimelike(arr).astype(dtype, copy=copy)

        elif dtype.kind == "U":
            # TODO: test cases with arr.dtype.kind in ["m", "M"]
            if is_ndarray:
                arr = cast(np.ndarray, arr)
                shape = arr.shape
                if arr.ndim > 1:
                    arr = arr.ravel()
            else:
                shape = (len(arr),)
            return lib.ensure_string_array(arr, convert_na_value=False, copy=copy).reshape(
                shape
            )

        elif dtype.kind in ["m", "M"]:
            return maybe_cast_to_datetime(arr, dtype)

        # GH#15832: Check if we are requesting a numeric dtype and
        # that we can convert the data to the requested dtype.
        elif is_integer_dtype(dtype):
            # this will raise if we have e.g. floats

            subarr = maybe_cast_to_integer_array(arr, dtype)
        else:
>           subarr = np.array(arr, dtype=dtype, copy=copy)
E           ValueError: could not convert string to float: 'OAK198204060'

venv3.9/lib/python3.9/site-packages/pandas/core/construction.py:765: ValueError
========================================================== short test summary info ===========================================================
FAILED tests/pychadwick/chadwick/test_pychadwick.py::test_load_games_to_df - ValueError: could not convert string to float: 'OAK198204060'
======================================================== 1 failed, 5 passed in 1.75s =========================================================
"F",0,"","F","F","",0,0,"N",0,"N",0,"N",1,0,0,0,"","","","","F","F","F","F","F","F","F","F","F","","","","F","F","F","F","F","","","","",0,0,0,0,0,0,0,0,0,"PHI","PHI","NY1",1,"T","F",2,3,0,47,0,"T","F",0,1,"F","F","hamng102","ennid101","F","F",0,0,0,0,0,0,0,0,0,"","","",0,0,0,0,0,0,0,0,0,0,0,0,0,"","F","F","F","F",0,0,0,0,0,0,0,0,0,0,F,F
"PHI195409260","NY1",11,1,0,0,0,"",3,2,"hamng102","?","hamng102","?","speng102","?","speng102","?","garaj101","lockw101","willd102","amalj101","gardb101","rhodd101","maysw101","mueld101","burgs101","","","64(1)3/GDP","F","F",4,4,2,"T","T",0,"F","F",2,"T","F",0,"F","F",6,"G","F","F","",0,0,"N",0,"N",0,"N",0,0,0,0,"43","64","","","F","F","F","F","F","F","F","F","F","speng102","","","F","F","F","F","F","","","","",0,4,3,0,6,4,0,0,0,"PHI","PHI","NY1",1,"F","F",2,3,0,48,1,"T","F",1,0,"T","T","ennid101","morgb102","F","F",2,3,99,0,0,0,0,0,0,"garaj101","","",0,0,0,0,0,0,0,0,0,0,0,0,0,"gardb101","T","F","F","F",0,0,0,0,0,0,0,0,0,0,F,F
"PHI195409260","NY1",11,1,2,0,0,"",3,2,"ennid101","?","ennid101","?","speng102","?","speng102","?","garaj101","lockw101","willd102","amalj101","gardb101","rhodd101","maysw101","mueld101","","","","5/FL","F","F",9,5,2,"T","T",0,"F","F",1,"F","F",0,"F","F",5,"F","F","T","",0,0,"N",0,"N",0,"N",0,0,0,0,"5","","","","F","F","F","F","F","F","F","F","F","","","","F","T","F","F","F","","","","",0,5,0,0,0,0,0,0,0,"PHI","PHI","NY1",1,"F","T",2,3,0,49,2,"T","F",0,0,"T","T","morgb102","jonew101","F","F",0,0,0,0,0,0,0,0,0,"","","",0,0,0,0,0,0,0,0,0,0,0,0,0,"amalj101","F","F","F","F",0,0,0,0,0,0,0,0,0,0,F,F
opening file /tmp/tmp.EVA

On pandas 1.3.5, it does work, but with a deprecation FutureWarning for the initializing of a DataFrame with a non-castable dtype arg:

(venv3.7) nick@astra:~/dev/nickball/forks/pychadwick$ python --version
Python 3.7.16
(venv3.7) nick@astra:~/dev/nickball/forks/pychadwick$ pip freeze | grep pandas
pandas==1.3.5
(venv3.7) nick@astra:~/dev/nickball/forks/pychadwick$ pytest tests/
============================================================ test session starts =============================================================
platform linux -- Python 3.7.16, pytest-5.4.3, py-1.11.0, pluggy-0.13.1
rootdir: /home/nick/dev/nickball/forks/pychadwick
collected 6 items

tests/pychadwick/chadwick/test_pychadwick.py ......                                                                                    [100%]

============================================================== warnings summary ==============================================================
tests/pychadwick/chadwick/test_pychadwick.py::test_load_games_to_df
  /home/nick/dev/nickball/forks/pychadwick/venv3.7/lib/python3.7/site-packages/pychadwick-0.5.0-py3.7-linux-x86_64.egg/pychadwick/chadwick.py:249: FutureWarning: Could not cast to float64, falling back to object. This behavior is deprecated. In a future version, when a dtype is passed to 'DataFrame', either all columns will be cast to that dtype, or a TypeError will be raised
    for game_ptr in games

tests/pychadwick/chadwick/test_pychadwick.py::test_load_games_to_df
  /home/nick/dev/nickball/forks/pychadwick/tests/pychadwick/chadwick/test_pychadwick.py:55: FutureWarning: Could not cast to float64, falling back to object. This behavior is deprecated. In a future version, when a dtype is passed to 'DataFrame', either all columns will be cast to that dtype, or a TypeError will be raised
    df = chadwick.game_to_dataframe(next(games))

-- Docs: https://docs.pytest.org/en/latest/warnings.html
======================================================= 6 passed, 2 warnings in 5.47s ========================================================
"F",0,"","F","F","",0,0,"N",0,"N",0,"N",1,0,0,0,"","","","","F","F","F","F","F","F","F","F","F","","","","F","F","F","F","F","","","","",0,0,0,0,0,0,0,0,0,"PHI","PHI","NY1",1,"T","F",2,3,0,47,0,"T","F",0,1,"F","F","hamng102","ennid101","F","F",0,0,0,0,0,0,0,0,0,"","","",0,0,0,0,0,0,0,0,0,0,0,0,0,"","F","F","F","F",0,0,0,0,0,0,0,0,0,0,F,F
"PHI195409260","NY1",11,1,0,0,0,"",3,2,"hamng102","?","hamng102","?","speng102","?","speng102","?","garaj101","lockw101","willd102","amalj101","gardb101","rhodd101","maysw101","mueld101","burgs101","","","64(1)3/GDP","F","F",4,4,2,"T","T",0,"F","F",2,"T","F",0,"F","F",6,"G","F","F","",0,0,"N",0,"N",0,"N",0,0,0,0,"43","64","","","F","F","F","F","F","F","F","F","F","speng102","","","F","F","F","F","F","","","","",0,4,3,0,6,4,0,0,0,"PHI","PHI","NY1",1,"F","F",2,3,0,48,1,"T","F",1,0,"T","T","ennid101","morgb102","F","F",2,3,99,0,0,0,0,0,0,"garaj101","","",0,0,0,0,0,0,0,0,0,0,0,0,0,"gardb101","T","F","F","F",0,0,0,0,0,0,0,0,0,0,F,F
"PHI195409260","NY1",11,1,2,0,0,"",3,2,"ennid101","?","ennid101","?","speng102","?","speng102","?","garaj101","lockw101","willd102","amalj101","gardb101","rhodd101","maysw101","mueld101","","","","5/FL","F","F",9,5,2,"T","T",0,"F","F",1,"F","F",0,"F","F",5,"F","F","T","",0,0,"N",0,"N",0,"N",0,0,0,0,"5","","","","F","F","F","F","F","F","F","F","F","","","","F","T","F","F","F","","","","",0,5,0,0,0,0,0,0,0,"PHI","PHI","NY1",1,"F","T",2,3,0,49,2,"T","F",0,0,"T","T","morgb102","jonew101","F","F",0,0,0,0,0,0,0,0,0,"","","",0,0,0,0,0,0,0,0,0,0,0,0,0,"amalj101","F","F","F","F",0,0,0,0,0,0,0,0,0,0,F,F
opening file /tmp/tmp.EVA

Forcing pandas to v1 by using a range in the requirements version spec also fixes this:

(venv3.7) nick@astra:~/dev/nickball/forks/pychadwick$ grep pandas requirements.txt
pandas>=1.1.0<2.0
(venv3.7) nick@astra:~/dev/nickball/forks/pychadwick$ pip freeze | grep pandas
pandas==1.3.5
(venv3.7) nick@astra:~/dev/nickball/forks/pychadwick$ pytest tests/ -q
......

Reproduce

Full test case:

(venv) nick@astra:~/temp/repro$ python3 --version
Python 3.9.16
(venv) nick@astra:~/temp/repro$ pip install pychadwick
Collecting pychadwick
  Using cached pychadwick-0.5.0.tar.gz (119 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Collecting pandas>=1.0.4
  Using cached pandas-2.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.4 MB)
[...]
(venv) nick@astra:~/temp/repro$ pip freeze | grep pandas
pandas==2.0.2
(venv) nick@astra:~/temp/repro$ python3
Python 3.9.16 (main, Dec  7 2022, 01:12:08)
[GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pychadwick.chadwick import Chadwick
>>> chadwick = Chadwick()
>>> file_path = "https://raw.githubusercontent.com/chadwickbureau/retrosheet/master/event/regular/1982OAK.EVA"
>>> games = chadwick.games(file_path)
>>> game = next(games)
>>> game
    <pychadwick.game.LP_CWGame object at 0x7f485c7896c0>
>>> chadwick.game_to_dataframe(game)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/nick/temp/repro/venv/lib/python3.9/site-packages/pychadwick/chadwick.py", line 259, in game_to_dataframe
    pd.DataFrame(list(self.process_game(game_ptr)), dtype="f8"),
  File "/home/nick/temp/repro/venv/lib/python3.9/site-packages/pandas/core/frame.py", line 790, in __init__
    mgr = arrays_to_mgr(
  File "/home/nick/temp/repro/venv/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 120, in arrays_to_mgr
    arrays, refs = _homogenize(arrays, index, dtype)
  File "/home/nick/temp/repro/venv/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 607, in _homogenize
    val = sanitize_array(val, index, dtype=dtype, copy=False)
  File "/home/nick/temp/repro/venv/lib/python3.9/site-packages/pandas/core/construction.py", line 576, in sanitize_array
    subarr = _try_cast(data, dtype, copy)
  File "/home/nick/temp/repro/venv/lib/python3.9/site-packages/pandas/core/construction.py", line 765, in _try_cast
    subarr = np.array(arr, dtype=dtype, copy=copy)
ValueError: could not convert string to float: 'OAK198204060'
>>>

Builds on 3.8+ can also be enabled with pull https://github.com/bdilday/pychadwick/pull/29 to reproduce this.

bdilday commented 1 year ago

closed by https://github.com/bdilday/pychadwick/pull/31