databricks / databricks-sql-python

Databricks SQL Connector for Python
Apache License 2.0
138 stars 83 forks source link

Unpin `pandas` #342

Open dhirschfeld opened 4 months ago

dhirschfeld commented 4 months ago

I would like to be able to use this library with the latest pandas version. Currently pandas is pinned to <2.2.0: https://github.com/databricks/databricks-sql-python/blob/05529900858d40add7bc9b7e4a8864921680cfa2/pyproject.toml#L14-L16

It would be good to remove this restriction.

dhirschfeld commented 4 months ago

The pin was added in:

To fix the issue described in:

...but that just avoids the problem whilst causing another problem; this library can't be used with the latest pandas :/

dhirschfeld commented 4 months ago

I'm opening this issue to track any progress towards compatibility with the latest pandas version.

dhirschfeld commented 4 months ago

Bump! I would like to upgrade to the latest version but am stuck on 3.0.1 because of this pin 😔

benc-db commented 3 months ago

Does 3.0.1 work with latest pandas? That would be an interesting data point.

dhirschfeld commented 1 month ago

Does 3.0.1 work with latest pandas? That would be an interesting data point.

I've been using 3.0.1 in combination with pandas 2.2.2 with no issues:

❯ pip list | rg 'pandas|databricks'
databricks-connect              14.3.1
databricks-sdk                  0.20.0
databricks-sql-connector        3.0.1
pandas                          2.2.2

...but that's apparently because I don't query all int data sources. Running:

with engine.connect() as conn:
    res = conn.execute(sa.text("select 1")).scalar_one()

gives:

TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'
dhirschfeld commented 1 month ago

It seems like it doesn't like assigning a None into an integer array:

> /opt/python/envs/dev310/lib/python3.10/site-packages/pandas/core/internals/managers.py(1703)as_array()
   1701             pass
   1702         else:
-> 1703             arr[isna(arr)] = na_value
   1704 
   1705         return arr.transpose()

ipdb>  arr
array([[1]], dtype=int32)

ipdb>  isna(arr)
array([[False]])

ipdb>  na_value

ipdb>  na_value is None
True

If we go up the stack we can see we get type errors if we try to assign anything other than an integer:

> /opt/python/envs/dev310/lib/python3.10/site-packages/databricks/sql/client.py(1149)_convert_arrow_table()
   1147         )
   1148 
-> 1149         res = df.to_numpy(na_value=None)
   1150         return [ResultRow(*v) for v in res]
   1151 

ipdb>  df.to_numpy(na_value=None)
*** TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'

ipdb>  df.to_numpy(na_value=float('NaN'))
*** ValueError: cannot convert float NaN to integer

ipdb>  df.to_numpy(na_value=-99)
array([[1]], dtype=int32)

Casting to object before assigning does seem to work:

ipdb>  df.astype(object).to_numpy(na_value=None)
array([[1]], dtype=object)
dhirschfeld commented 1 month ago

The problematic function: https://github.com/databricks/databricks-sql-python/blob/a6e9b11131871de8b673e3072c5b64498df68217/src/databricks/sql/client.py#L1130-L1166

dhirschfeld commented 1 month ago

I can work around the issue by disabling pandas:

with engine.connect() as conn:
    cursor = conn.connection.cursor()
    cursor.connection.disable_pandas = True
    res = cursor.execute("select 1").fetchall()
>>> res
[Row(1=1)]

...but obviously the casting to numpy needs to be fixed.

dhirschfeld commented 1 month ago

Probably casting to object before assigning a None value is the right fix.

diego-jd commented 1 month ago

I second this. I cannot use pd.read_sql_query() because of this requirement.

Also, it would be good if you delete the distutils dependency