CODAIT / text-extensions-for-pandas

Natural language processing support for Pandas dataframes.
Apache License 2.0
215 stars 34 forks source link

Update CoNLL-U support to fully cover EWT data set #191

Closed frreiss closed 3 years ago

frreiss commented 3 years ago

text_extensions_for_pandas.io.conll.conll_2003_to_dataframes currently has trouble parsing data files from the EWT data set for shallow semantic parsing (https://github.com/UniversalDependencies/UD_English-EWT).

We need to update this support to cover this data set.

Recommended approach:

The change set for this item should include a test case that downloads and parses part of the EWT data set when run. Be sure to cache the downloaded files. Be sure to avoid checking in EWT data to the Text Extensions for Pandas repository.

frreiss commented 3 years ago

Some specific data sets that should work out of the box:

frreiss commented 3 years ago

Known "gotchas" with CoNLL-U: File encoding should be UTF-8, but can't rely on that.

Comments often contain data. For example, the first document in the dev fold of EWT:

# newdoc id = weblog-blogspot.com_nominations_20041117172713_ENG_20041117_172713
# sent_id = weblog-blogspot.com_nominations_20041117172713_ENG_20041117_172713-0001
# newpar id = weblog-blogspot.com_nominations_20041117172713_ENG_20041117_172713-p0001
# text = From the AP comes this story :
1   From    from    ADP IN  _   3   case    3:case  _
2   the the DET DT  Definite=Def|PronType=Art   3   det 3:det   _
3   AP  AP  PROPN   NNP Number=Sing 4   obl 4:obl:from  _
4   comes   come    VERB    VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   0   root    0:root  _
5   this    this    DET DT  Number=Sing|PronType=Dem    6   det 6:det   _
6   story   story   NOUN    NN  Number=Sing 4   nsubj   4:nsubj _
7   :   :   PUNCT   :   _   4   punct   4:punct _

There can be subword tokens. Users should be able to specify whether to merge these.

Sometimes there are multiple tokens on one line, followed by the same tokens broken out separately; example from EWT:

29-30   didn't  _   _   _   _   _   _   _   SpaceAfter=No
29  did do  VERB    VBD Mood=Ind|Tense=Past|VerbForm=Fin    4   conj    4:conj:but  _
30  n't not PART    RB  _   29  advmod  29:advmod   _

There can be artificial tokens (i.e. words that don't exist in the original text) that were added to fill out the parse tree.

There can be explicit information about whitespace: