Refactor span classes to lower users' learning curve

frreiss commented 4 years ago

This PR makes some changes to our class names to simplify the learning process for new users. The motivation behind this change is the following principle: A user who only needs the functionality of character-based spans should not need to understand token-based spans.

The main changes I've implemented are:

We now refer to character-based spans as just "spans". The class that used to be called CharSpan is now called Span; CharSpanArray is now called SpanArray; and so on.
Our dtypes now have names that end in "Dtype", for consistency with how Pandas names its data type objects.
Instead of returning two columns of per-token spans ("token_span" and "char_span"), all the syntax analysis input functions (SpaCy, CoNLL, and Watson NLU) now return just a column "span" of dtype SpanDtype
I've updated all the relevant example notebooks to reflect the new nomenclature. We no longer represent each token with both a "char_span" and a "token_span" column.
Analyze_Model_Outputs.ipynb and Analyze_Text.ipynb still use TokenSpanDtype, but they only use it to store spans that both cover multiple tokens and are constrained to start and end on token boundaries.

review-notebook-app[bot] commented 4 years ago

Check out this pull request on

Review Jupyter notebook visual diffs & provide feedback on notebooks.

Powered by ReviewNB

frreiss commented 4 years ago

CI found a regression. Pushed a fix.

CODAIT / text-extensions-for-pandas

Refactor span classes to lower users' learning curve #103