Closed frreiss closed 3 years ago
@BryanCutler can you have a look at this design for multi-doc support and see if you notice any gotchas on either Arrow integration or passing the full Pandas ExtensionArray suite?
Merged with changes from master, found a bunch of regressions, and fixed them.
@BryanCutler how would you recommend we encode SpanArray
s with multiple target texts in Arrow? Should we directly serialize the string table and an array of integer offsets into the table? Should we use an Arrow Dictionary array? Something else?
I was thinking the best thing to do would be a DictionaryArray, but I'm not clear on how that would work with Pandas or if it's even implemented, so I'll have to try it out. If not, we could do something different, but might not be a clean as a dictionary.
@frreiss I did a preliminary test of using a nested DictionaryArray through saving/loading a feather file and it works, so we should definitely go that route. It also might work that we could use the same DictionaryArray across different columns and only write to the file once, to avoid duplicating the document text. I'll have test more for that, but in the meantime we should proceed. Do you want to finish up with this PR and then I can follow with the Arrow changes?
Makes sense, @BryanCutler. If we'll be using Arrow's native dictionary coding for serialization, do you think it would make sense to replace the guts of the StringTable
class with Arrow's dictionary data structure?
Makes sense, @BryanCutler. If we'll be using Arrow's native dictionary coding for serialization, do you think it would make sense to replace the guts of the StringTable class with Arrow's dictionary data structure?
It might be a good idea, that would make it so there isn't a need to copy documents when doing IO. Basically you would just need to change _id_to_str
to a pa.Array
of type string. Also, I think a numpy array of strings would work fine too, if you don't want to add an arrow dependency there.
Updated token-based spans to support multiple documents. This change required refactoring of the previous changes to character-based spans.
In the process, I changed the API for creating span arrays from a constructor call to a factory method. That is, SpanArray.create()
instead of SpanArray()
.
Updated Feather support is still not implemented. I've disabled the tests of Feather for now.
All the other regression tests are now working.
Still need to update the notebooks to reflect the constructor changes.
Sounds good @frreiss , as soon as this is merged I'll work on the arrow changes and enable the tests again.
Updated the notebooks and fixed a few additional bugs.
@BryanCutler I think these changes are about ready to merge, with the exception of revamping the Arrow/Feather support. Can you review these changes, please?
Pushed changes that simplify the way TokenSpanArray
manages multiple tokenizations and restore the conventional constructor APIs. @BryanCutler can you please have a look when you get a chance?
Pushed some additional changes to HTML rendering so that spans will render correctly in JupyterLab dark mode and in VSCode notebooks.
Thanks for the careful review, @BryanCutler ! I'm going to merge these changes into master now.
Work in progress, do not merge yet.
This PR contains the first step towards implementing #73. There's quite a bit of additional work to be done, but I'm sharing this WIP PR to get some feedback on the current design.
Here's what has been implemented so far:
StringTable
, for efficiently managing the connection between document texts and spans.SpanArray
class to support mixing spans from multiple documents.test_span.py
pass.Major things still remaining:
TokenSpanArray
class to also support multiple documentsspanner
package to support multiple documents per span array