Add multi-document support for span arrays

frreiss commented 3 years ago

Work in progress, do not merge yet.

This PR contains the first step towards implementing #73. There's quite a bit of additional work to be done, but I'm sharing this WIP PR to get some feedback on the current design.

Here's what has been implemented so far:

Added a new class, StringTable, for efficiently managing the connection between document texts and spans.
Modified the SpanArray class to support mixing spans from multiple documents.
The existing unit tests in test_span.py pass.

Major things still remaining:

Modify the TokenSpanArray class to also support multiple documents
Update the Arrow integration to support multiple documents per array
Update the algorithms under the spanner package to support multiple documents per span array
Get all existing regression tests to pass
Add new regression tests for multi-document support
Update the example notebooks as needed

frreiss commented 3 years ago

@BryanCutler can you have a look at this design for multi-doc support and see if you notice any gotchas on either Arrow integration or passing the full Pandas ExtensionArray suite?

frreiss commented 3 years ago

Merged with changes from master, found a bunch of regressions, and fixed them.

frreiss commented 3 years ago

@BryanCutler how would you recommend we encode SpanArrays with multiple target texts in Arrow? Should we directly serialize the string table and an array of integer offsets into the table? Should we use an Arrow Dictionary array? Something else?

BryanCutler commented 3 years ago

I was thinking the best thing to do would be a DictionaryArray, but I'm not clear on how that would work with Pandas or if it's even implemented, so I'll have to try it out. If not, we could do something different, but might not be a clean as a dictionary.

BryanCutler commented 3 years ago

@frreiss I did a preliminary test of using a nested DictionaryArray through saving/loading a feather file and it works, so we should definitely go that route. It also might work that we could use the same DictionaryArray across different columns and only write to the file once, to avoid duplicating the document text. I'll have test more for that, but in the meantime we should proceed. Do you want to finish up with this PR and then I can follow with the Arrow changes?

frreiss commented 3 years ago

Makes sense, @BryanCutler. If we'll be using Arrow's native dictionary coding for serialization, do you think it would make sense to replace the guts of the StringTable class with Arrow's dictionary data structure?

BryanCutler commented 3 years ago

Makes sense, @BryanCutler. If we'll be using Arrow's native dictionary coding for serialization, do you think it would make sense to replace the guts of the StringTable class with Arrow's dictionary data structure?

It might be a good idea, that would make it so there isn't a need to copy documents when doing IO. Basically you would just need to change _id_to_str to a pa.Array of type string. Also, I think a numpy array of strings would work fine too, if you don't want to add an arrow dependency there.

frreiss commented 3 years ago

Updated token-based spans to support multiple documents. This change required refactoring of the previous changes to character-based spans.

In the process, I changed the API for creating span arrays from a constructor call to a factory method. That is, SpanArray.create() instead of SpanArray().

Updated Feather support is still not implemented. I've disabled the tests of Feather for now.

All the other regression tests are now working.

Still need to update the notebooks to reflect the constructor changes.

BryanCutler commented 3 years ago

Sounds good @frreiss , as soon as this is merged I'll work on the arrow changes and enable the tests again.

frreiss commented 3 years ago

Updated the notebooks and fixed a few additional bugs.

frreiss commented 3 years ago

@BryanCutler I think these changes are about ready to merge, with the exception of revamping the Arrow/Feather support. Can you review these changes, please?

frreiss commented 3 years ago

Pushed changes that simplify the way TokenSpanArray manages multiple tokenizations and restore the conventional constructor APIs. @BryanCutler can you please have a look when you get a chance?

frreiss commented 3 years ago

Pushed some additional changes to HTML rendering so that spans will render correctly in JupyterLab dark mode and in VSCode notebooks.

frreiss commented 3 years ago

Thanks for the careful review, @BryanCutler ! I'm going to merge these changes into master now.

CODAIT / text-extensions-for-pandas

Add multi-document support for span arrays #170