CODAIT / text-extensions-for-pandas

Natural language processing support for Pandas dataframes.
Apache License 2.0
215 stars 34 forks source link

Fix Arrow serializaiton for SpanArray multidoc support #181

Closed BryanCutler closed 3 years ago

BryanCutler commented 3 years ago

This changes Arrow serialization for SpanArray to store documents in a dictionary that is indexed by text ids. Also added support for saving to Parquet files.

From #179

BryanCutler commented 3 years ago

@frreiss this seems like a good improvement for SpanArray serialization - much better to store in a dictionary batch rather than field metadata. If this looks ok, I'll get started on TokenSpanArray.

BryanCutler commented 3 years ago

I think I addressed all and tests are passing. I'll go ahead and merge now and fix up anything with a followup or when I fix TokenSpanArray arrow conversion.