Open dhirschfeld opened 4 months ago
The pin was added in:
To fix the issue described in:
...but that just avoids the problem whilst causing another problem; this library can't be used with the latest pandas
:/
I'm opening this issue to track any progress towards compatibility with the latest pandas
version.
Bump! I would like to upgrade to the latest version but am stuck on 3.0.1 because of this pin 😔
Does 3.0.1 work with latest pandas? That would be an interesting data point.
Does 3.0.1 work with latest pandas? That would be an interesting data point.
I've been using 3.0.1 in combination with pandas
2.2.2 with no issues:
❯ pip list | rg 'pandas|databricks'
databricks-connect 14.3.1
databricks-sdk 0.20.0
databricks-sql-connector 3.0.1
pandas 2.2.2
...but that's apparently because I don't query all int
data sources.
Running:
with engine.connect() as conn:
res = conn.execute(sa.text("select 1")).scalar_one()
gives:
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'
It seems like it doesn't like assigning a None into an integer array:
> /opt/python/envs/dev310/lib/python3.10/site-packages/pandas/core/internals/managers.py(1703)as_array()
1701 pass
1702 else:
-> 1703 arr[isna(arr)] = na_value
1704
1705 return arr.transpose()
ipdb> arr
array([[1]], dtype=int32)
ipdb> isna(arr)
array([[False]])
ipdb> na_value
ipdb> na_value is None
True
If we go up the stack we can see we get type errors if we try to assign anything other than an integer:
> /opt/python/envs/dev310/lib/python3.10/site-packages/databricks/sql/client.py(1149)_convert_arrow_table()
1147 )
1148
-> 1149 res = df.to_numpy(na_value=None)
1150 return [ResultRow(*v) for v in res]
1151
ipdb> df.to_numpy(na_value=None)
*** TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'
ipdb> df.to_numpy(na_value=float('NaN'))
*** ValueError: cannot convert float NaN to integer
ipdb> df.to_numpy(na_value=-99)
array([[1]], dtype=int32)
Casting to object
before assigning does seem to work:
ipdb> df.astype(object).to_numpy(na_value=None)
array([[1]], dtype=object)
I can work around the issue by disabling pandas:
with engine.connect() as conn:
cursor = conn.connection.cursor()
cursor.connection.disable_pandas = True
res = cursor.execute("select 1").fetchall()
>>> res
[Row(1=1)]
...but obviously the casting to numpy needs to be fixed.
Probably casting to object before assigning a None
value is the right fix.
I second this. I cannot use pd.read_sql_query()
because of this requirement.
Also, it would be good if you delete the distutils dependency
I would like to be able to use this library with the latest
pandas
version. Currentlypandas
is pinned to<2.2.0
: https://github.com/databricks/databricks-sql-python/blob/05529900858d40add7bc9b7e4a8864921680cfa2/pyproject.toml#L14-L16It would be good to remove this restriction.