CODAIT / text-extensions-for-pandas

Natural language processing support for Pandas dataframes.
Apache License 2.0
215 stars 34 forks source link

[WIP] Fix Arrow conversion for TokenSpanArray with multi-doc #182

Closed BryanCutler closed 3 years ago

BryanCutler commented 3 years ago

This fixes Arrow conversion of a TokenSpanArray that possibly uses multiple SpanArrays with multiple documents. The conversion follows these steps:

  1. Get SpanArrays used in the TokenSpanArray and convert these to arrow, each is a struct with begins, ends, and dictionary array with the document
  2. Concat the arrow SpanArrays together to make an arrow list of them
  3. Build an arrow dictionary array that maps the index of the SpanArray used by each element of the TokenSpanArray to the SpanArray in the arrow list
  4. create a struct array with token begins, token ends, and the token dictionary array

From #179

BryanCutler commented 3 years ago

@frreiss this is still a bit of a WIP, need to add some tests like actual multi-doc array. Please have a look at the implementation when you can. It's a little complicated since we have dictionary arrays that point to other dictionary arrays, but overall I think it's a better approach than before.

Some areas are rough due to some issues in pyarrow, that probably could be made into a JIRA to be improved upstream, but I'll have to look into it more. Also, it seems this only works with pyarrow >= 2.0.0, so we might want to bump up the minimum version.

frreiss commented 3 years ago

Merging this in its current form, as we need to have at least a semi-functional version of serialization in master so we can cut another release. @BryanCutler can you make a second PR with those testing improvements when they're ready?

BryanCutler commented 3 years ago

Thanks @frreiss . Yeah, I'll do a followup with testing requirements. What do you think of bumping the minimum pyarrow to 2.0.0? Alternatively, we could just raise an error if the user tries to serialize with a lesser version.