CODAIT / text-extensions-for-pandas

Natural language processing support for Pandas dataframes.
Apache License 2.0
215 stars 34 forks source link

Handle corner cases involving empty inputs to overlaps/contains join #161

Closed frreiss closed 3 years ago

frreiss commented 3 years ago

The current implementation of tp.spanner.overlap_join crashes while computing shingle lengths if both inputs are empty. This PR fixes this problem.

I have added regression tests for the case that triggered the original crash, as well as a number of other combinations of empty/non-empty inputs, plus combinations of character-based and token-based span arrays.

I also encountered and fixed a minor bug in the way that the constructors for SpanArray and TokenSpanArray handle empty Python lists of begin and end offsets.

frreiss commented 3 years ago

Hmm, looks like np.array() is not a no-op by default if the input is already the right kind of array. Pushed a fix.

BryanCutler commented 3 years ago

No, you want np.asarray() here is doc https://numpy.org/doc/stable/reference/generated/numpy.asarray.html

frreiss commented 3 years ago

I went a bit further than that and added some code to pass through any integer type without copying to int32. Merging this PR now.

BryanCutler commented 3 years ago

Ah yes, that is a good idea since any integer type should be fine.