CODAIT / text-extensions-for-pandas

Natural language processing support for Pandas dataframes.
Apache License 2.0
215 stars 34 forks source link

Update SpaCy support to cover new features #176

Open frreiss opened 3 years ago

frreiss commented 3 years ago

SpaCy 3.0's language models now produce some additional features that we don't currently translate to DataFrames. The parse tree information now includes information on children and ancestors. There is an is_sent_start flag to indicate whether a token is at the beginning of a sentence. There is support for embeddings in the vector field of Token. There are probably a few more. See https://spacy.io/api/token for the full list.

We should extend the existing SpaCy support in https://github.com/CODAIT/text-extensions-for-pandas/blob/master/text_extensions_for_pandas/io/spacy.py to support these additional features if present.

With these additional features, the DataFrame representation of the full output of a SpaCy language model is getting a bit large, so it would be a good idea to also add a facility to produce only the DataFrame columns that your application needs -- say, an additional argument to make_tokens_and_features that replaces and generalizes the existing add_left_and_right argument to control whether multiple columns appear in the output.