CODAIT / text-extensions-for-pandas

Natural language processing support for Pandas dataframes.
Apache License 2.0
215 stars 34 forks source link

Fix test failures with Pandas 1.2.0 #157

Closed BryanCutler closed 3 years ago

BryanCutler commented 3 years ago

There are a number of test failures when using Pandas 1.2.0.

Fixes #158

BryanCutler commented 3 years ago

There seem to be a number of different issue with Pandas 1.2.0, I'll look into fixing them here too.

frreiss commented 3 years ago

@BryanCutler any idea what's causing that test failure on the Python 3.7 build (https://github.com/CODAIT/text-extensions-for-pandas/pull/157/checks?check_run_id=1620791768)?

BryanCutler commented 3 years ago

I fixed most of the issues with pandas 1.2.0 and TensorArray, except there is a regression with repr() with floats, similar to #151. There has been some back and forth with my fix upstream for that one, so maybe I can fix both. Unfortunately, I don't know of a workaround, so we either have to leave it as a known issue or cap the supported pandas version < 1.2.0. I'm looking into the other issues now.

BryanCutler commented 3 years ago

@frreiss this failure https://github.com/CODAIT/text-extensions-for-pandas/pull/157/checks?check_run_id=1646927381#step:5:1438 in bert.align_bert_tokens_to_corpus_tokens() seems a bit strange. Not sure why this would suddenly change, unless they added some optimization that wasn't complete. I can look into it further, but letting you know in case you have some idea what's going on.

BryanCutler commented 3 years ago

Ok, I think I found why the above error is happening and looks to be a bug. Pandas attempts to do a cython aggregation and then fallback to a vanilla agg if there is an error, but checks the error message improperly for this case. I should be able to file a bug report and do a PR for the fix.

frreiss commented 3 years ago

Thanks for tracking down the root cause of that problem, @BryanCutler !

BryanCutler commented 3 years ago

PR for the agg fix https://github.com/pandas-dev/pandas/pull/38982

BryanCutler commented 3 years ago

@frreiss I capped the upper version of Pandas to < 1.2.0 because not being able to display a Series with TensorArray of floats is pretty major and I don't see a workaround. I'll keep working on the upstream fix, but what are your thoughts on merging this?

frreiss commented 3 years ago

I think merging is the best way forward for now. We may want to post a new release just to get that pandas<1.2.0 constraint into our requirements.txt.