CybercentreCanada / jupyterlab-sql-editor

A JupyterLab extension providing, SQL formatter, auto-completion, syntax highlighting, Spark SQL and Trino
BSD 3-Clause "New" or "Revised" License
83 stars 13 forks source link

Incompalibity with pandas ^2.0.0 when querying datetime64 columns #168

Open AchazRyus opened 1 month ago

AchazRyus commented 1 month ago

I am using pandas 2.0.0 (also happens with higher version of pandas) and performing SQL query using %%sparksql requesting datetime64 data. I encounter this error when performing basic query such as

%%sparksql select * from db.table

The error :

Casting to unit-less dtype 'datetime64' is not supported. Pass e.g. 'datetime64[ns]' instead.

The work around is to downgrade pandas. It could be solving the issue to have a converter when parsing a datetime series :

`from pyspark.sql.types import TimestampType

def correct_dtype(pser: pd.Series) -> pd.Series: if pd.api.types.is_datetime64_any_dtype(pser): return pser.astype('datetime64[ns]', copy=False) return pser.astype(pandas_type, copy=False)`

If this is not a behavior you wan't, maybe consider adding an upper limit on the pandas version in the pyproject.toml

"pandas<2.0.0"

Complete Stacktrace :

File [/opt/conda/lib/python3.11/site-packages/IPython/core/interactiveshell.py:2541](lib/python3.11/site-packages/IPython/core/interactiveshell.py#line=2540), in InteractiveShell.run_cell_magic(self, magic_name, line, cell)
   2539 with self.builtin_trap:
   2540     args = (magic_arg_s, cell)
-> 2541     result = fn(*args, **kwargs)
   2543 # The code below prevents the output from being displayed
   2544 # when using magics with decorator @output_can_be_silenced
   2545 # when the last Python token in the expression is a ';'.
   2546 if getattr(fn, magic.MAGIC_OUTPUT_CAN_BE_SILENCED, False):

File [/opt/conda/lib/python3.11/site-packages/jupyterlab_sql_editor/ipython_magic/sparksql/sparksql.py:263](opt/conda/lib/python3.11/site-packages/jupyterlab_sql_editor/ipython_magic/sparksql/sparksql.py#line=262), in SparkSql.sparksql(self, line, cell, local_ns)
    261 end = time()
    262 print(f"Execution time: {end - start:.2f} seconds")
--> 263 pdf = to_pandas(results, self.spark._jconf) if not use_cache else self.cached_pdf
    264 if len(pdf) > limit:
    265     print(f"Only showing top {limit} {'row' if limit == 1 else 'rows'}")

File [/opt/conda/lib/python3.11/site-packages/jupyterlab_sql_editor/outputters/util.py:632](opt/conda/lib/python3.11/site-packages/jupyterlab_sql_editor/outputters/util.py#line=631), in to_pandas(df, jconf)
    628     # struct_in_pandas = jconf.pandasStructHandlingMode()
    629     struct_in_pandas = "dict"
    631     return pd.concat(
--> 632         [
    633             _create_converter_to_pandas(
    634                 field.dataType,
    635                 field.nullable,
    636                 timezone=timezone,
    637                 struct_in_pandas=("row" if struct_in_pandas == "legacy" else struct_in_pandas),
    638                 error_on_duplicated_field_names=False,
    639                 timestamp_utc_localized=False,
    640             )(pser)
    641             for (_, pser), field in zip(pdf.items(), df.schema.fields)
    642         ],
    643         axis="columns",
    644     )
    645 else:
    646     return pdf

File [/opt/conda/lib/python3.11/site-packages/jupyterlab_sql_editor/outputters/util.py:633](opt/conda/lib/python3.11/site-packages/jupyterlab_sql_editor/outputters/util.py#line=632), in <listcomp>(.0)
    628     # struct_in_pandas = jconf.pandasStructHandlingMode()
    629     struct_in_pandas = "dict"
    631     return pd.concat(
    632         [
--> 633             _create_converter_to_pandas(
    634                 field.dataType,
    635                 field.nullable,
    636                 timezone=timezone,
    637                 struct_in_pandas=("row" if struct_in_pandas == "legacy" else struct_in_pandas),
    638                 error_on_duplicated_field_names=False,
    639                 timestamp_utc_localized=False,
    640             )(pser)
    641             for (_, pser), field in zip(pdf.items(), df.schema.fields)
    642         ],
    643         axis="columns",
    644     )
    645 else:
    646     return pdf

File [/opt/conda/lib/python3.11/site-packages/jupyterlab_sql_editor/outputters/util.py:358](opt/conda/lib/python3.11/site-packages/jupyterlab_sql_editor/outputters/util.py#line=357), in _create_converter_to_pandas.<locals>.correct_dtype(pser)
    357 def correct_dtype(pser: pd.Series) -> pd.Series:
--> 358     return pser.astype(pandas_type, copy=False)

File [/opt/conda/lib/python3.11/site-packages/pandas/core/generic.py:6324](opt/conda/lib/python3.11/site-packages/pandas/core/generic.py#line=6323), in NDFrame.astype(self, dtype, copy, errors)
   6317     results = [
   6318         self.iloc[:, i].astype(dtype, copy=copy)
   6319         for i in range(len(self.columns))
   6320     ]
   6322 else:
   6323     # else, only a single dtype is given
-> 6324     new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
   6325     return self._constructor(new_data).__finalize__(self, method="astype")
   6327 # GH 33113: handle empty frame or series

File [/opt/conda/lib/python3.11/site-packages/pandas/core/internals/managers.py:451](opt/conda/lib/python3.11/site-packages/pandas/core/internals/managers.py#line=450), in BaseBlockManager.astype(self, dtype, copy, errors)
    448 elif using_copy_on_write():
    449     copy = False
--> 451 return self.apply(
    452     "astype",
    453     dtype=dtype,
    454     copy=copy,
    455     errors=errors,
    456     using_cow=using_copy_on_write(),
    457 )

File [/opt/conda/lib/python3.11/site-packages/pandas/core/internals/managers.py:352](opt/conda/lib/python3.11/site-packages/pandas/core/internals/managers.py#line=351), in BaseBlockManager.apply(self, f, align_keys, **kwargs)
    350         applied = b.apply(f, **kwargs)
    351     else:
--> 352         applied = getattr(b, f)(**kwargs)
    353     result_blocks = extend_blocks(applied, result_blocks)
    355 out = type(self).from_blocks(result_blocks, self.axes)

File [/opt/conda/lib/python3.11/site-packages/pandas/core/internals/blocks.py:511](opt/conda/lib/python3.11/site-packages/pandas/core/internals/blocks.py#line=510), in Block.astype(self, dtype, copy, errors, using_cow)
    491 """
    492 Coerce to the new dtype.
    493 
   (...)
    507 Block
    508 """
    509 values = self.values
--> 511 new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
    513 new_values = maybe_coerce_values(new_values)
    515 refs = None

File [/opt/conda/lib/python3.11/site-packages/pandas/core/dtypes/astype.py:242](opt/conda/lib/python3.11/site-packages/pandas/core/dtypes/astype.py#line=241), in astype_array_safe(values, dtype, copy, errors)
    239     dtype = dtype.numpy_dtype
    241 try:
--> 242     new_values = astype_array(values, dtype, copy=copy)
    243 except (ValueError, TypeError):
    244     # e.g. _astype_nansafe can fail on object-dtype of strings
    245     #  trying to convert to float
    246     if errors == "ignore":

File [/opt/conda/lib/python3.11/site-packages/pandas/core/dtypes/astype.py:184](opt/conda/lib/python3.11/site-packages/pandas/core/dtypes/astype.py#line=183), in astype_array(values, dtype, copy)
    180     return values
    182 if not isinstance(values, np.ndarray):
    183     # i.e. ExtensionArray
--> 184     values = values.astype(dtype, copy=copy)
    186 else:
    187     values = _astype_nansafe(values, dtype, copy=copy)

File [/opt/conda/lib/python3.11/site-packages/pandas/core/arrays/datetimes.py:694](opt/conda/lib/python3.11/site-packages/pandas/core/arrays/datetimes.py#line=693), in DatetimeArray.astype(self, dtype, copy)
    682     raise TypeError(
    683         "Cannot use .astype to convert from timezone-aware dtype to "
    684         "timezone-naive dtype. Use obj.tz_localize(None) or "
    685         "obj.tz_convert('UTC').tz_localize(None) instead."
    686     )
    688 elif (
    689     self.tz is None
    690     and is_datetime64_dtype(dtype)
    691     and dtype != self.dtype
    692     and is_unitless(dtype)
    693 ):
--> 694     raise TypeError(
    695         "Casting to unit-less dtype 'datetime64' is not supported. "
    696         "Pass e.g. 'datetime64[ns]' instead."
    697     )
    699 elif is_period_dtype(dtype):
    700     return self.to_period(freq=dtype.freq)

TypeError: Casting to unit-less dtype 'datetime64' is not supported. Pass e.g. 'datetime64[ns]' instead.