Closed BryanCutler closed 3 years ago
@frreiss this is still a bit of a WIP, need to add some tests like actual multi-doc array. Please have a look at the implementation when you can. It's a little complicated since we have dictionary arrays that point to other dictionary arrays, but overall I think it's a better approach than before.
Some areas are rough due to some issues in pyarrow, that probably could be made into a JIRA to be improved upstream, but I'll have to look into it more. Also, it seems this only works with pyarrow >= 2.0.0, so we might want to bump up the minimum version.
Merging this in its current form, as we need to have at least a semi-functional version of serialization in master so we can cut another release. @BryanCutler can you make a second PR with those testing improvements when they're ready?
Thanks @frreiss . Yeah, I'll do a followup with testing requirements. What do you think of bumping the minimum pyarrow to 2.0.0? Alternatively, we could just raise an error if the user tries to serialize with a lesser version.
This fixes Arrow conversion of a TokenSpanArray that possibly uses multiple SpanArrays with multiple documents. The conversion follows these steps:
From #179