CODAIT / text-extensions-for-pandas

Natural language processing support for Pandas dataframes.
Apache License 2.0
215 stars 34 forks source link

2+ dimensional tensors of timestamps crash pd.Series.repr() #151

Closed frreiss closed 3 years ago

frreiss commented 3 years ago

Code to reproduce:

import text_extensions_for_pandas as tp
import pandas as pd
import numpy as np

times = pd.date_range('2018-01-01', periods=5, freq='H').to_numpy()
times_repeated = np.tile(times, (3, 1))
times_array = tp.TensorArray(times_repeated)
times_series = pd.Series(times_array)
print(repr(times_series))

Expected result: Display a 3x5 matrix of timestamps

Actual result: Crash from inside a Pandas routine that should only be called for 1-D arrays of timestamps. Stack trace follows.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-13-b6b4c1b73ef2> in <module>
      7 times_array = tp.TensorArray(times_repeated)
      8 times_series = pd.Series(times_array)
----> 9 repr(times_series)

~/pd/cn-update/env/lib/python3.7/site-packages/pandas/core/series.py in __repr__(self)
   1331             min_rows=min_rows,
   1332             max_rows=max_rows,
-> 1333             length=show_dimensions,
   1334         )
   1335         result = buf.getvalue()

~/pd/cn-update/env/lib/python3.7/site-packages/pandas/core/series.py in to_string(self, buf, na_rep, float_format, header, index, length, dtype, name, max_rows, min_rows)
   1396             max_rows=max_rows,
   1397         )
-> 1398         result = formatter.to_string()
   1399 
   1400         # catch contract violations

~/pd/cn-update/env/lib/python3.7/site-packages/pandas/io/formats/format.py in to_string(self)
    356 
    357         fmt_index, have_header = self._get_formatted_index()
--> 358         fmt_values = self._get_formatted_values()
    359 
    360         if self.truncate_v:

~/pd/cn-update/env/lib/python3.7/site-packages/pandas/io/formats/format.py in _get_formatted_values(self)
    345             None,
    346             float_format=self.float_format,
--> 347             na_rep=self.na_rep,
    348         )
    349 

~/pd/cn-update/env/lib/python3.7/site-packages/pandas/io/formats/format.py in format_array(values, formatter, float_format, na_rep, digits, space, justify, decimal, leading_space, quoting)
   1177     )
   1178 
-> 1179     return fmt_obj.get_result()
   1180 
   1181 

~/pd/cn-update/env/lib/python3.7/site-packages/pandas/io/formats/format.py in get_result(self)
   1208 
   1209     def get_result(self) -> List[str]:
-> 1210         fmt_values = self._format_strings()
   1211         return _make_fixed_width(fmt_values, self.justify)
   1212 

~/pd/cn-update/env/lib/python3.7/site-packages/pandas/io/formats/format.py in _format_strings(self)
   1498             space=self.space,
   1499             justify=self.justify,
-> 1500             leading_space=self.leading_space,
   1501         )
   1502         return fmt_values

~/pd/cn-update/env/lib/python3.7/site-packages/pandas/io/formats/format.py in format_array(values, formatter, float_format, na_rep, digits, space, justify, decimal, leading_space, quoting)
   1177     )
   1178 
-> 1179     return fmt_obj.get_result()
   1180 
   1181 

~/pd/cn-update/env/lib/python3.7/site-packages/pandas/io/formats/format.py in get_result(self)
   1208 
   1209     def get_result(self) -> List[str]:
-> 1210         fmt_values = self._format_strings()
   1211         return _make_fixed_width(fmt_values, self.justify)
   1212 

~/pd/cn-update/env/lib/python3.7/site-packages/pandas/io/formats/format.py in _format_strings(self)
   1468 
   1469         if self.formatter is not None and callable(self.formatter):
-> 1470             return [self.formatter(x) for x in values]
   1471 
   1472         fmt_values = format_array_from_datetime(

~/pd/cn-update/env/lib/python3.7/site-packages/pandas/io/formats/format.py in <listcomp>(.0)
   1468 
   1469         if self.formatter is not None and callable(self.formatter):
-> 1470             return [self.formatter(x) for x in values]
   1471 
   1472         fmt_values = format_array_from_datetime(

~/pd/cn-update/env/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py in __iter__(self)
    565             end_i = min((i + 1) * chunksize, length)
    566             converted = ints_to_pydatetime(
--> 567                 data[start_i:end_i], tz=self.tz, freq=self.freq, box="timestamp"
    568             )
    569             for v in converted:

pandas/_libs/tslibs/vectorized.pyx in pandas._libs.tslibs.vectorized.ints_to_pydatetime()

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
BryanCutler commented 3 years ago

The issue here is that the Series uses the Pandas Datetime64Formatter which converts the values to a DatetiemIndex that expects to be 1-dimensional. I will look some more for a possible workaround or upstream fix.

BryanCutler commented 3 years ago

Submitted a fix upstream to Pandas at https://github.com/pandas-dev/pandas/pull/38391