Update CoNLL-U support to fully cover EWT data set

frreiss commented 3 years ago

text_extensions_for_pandas.io.conll.conll_2003_to_dataframes currently has trouble parsing data files from the EWT data set for shallow semantic parsing (https://github.com/UniversalDependencies/UD_English-EWT).

We need to update this support to cover this data set.

Recommended approach:

Create a new entry point specific to the format used in EWT -- say, conll_u_to_dataframes()
Add a new argument to cover the document metadata that EWT includes in comments
Add support for optional fields at the end of the record, repeated any number of times in sequence
Factor out common code with the existing conll_2003_to_dataframes into a shared internal entry point

The change set for this item should include a test case that downloads and parses part of the EWT data set when run. Be sure to cache the downloaded files. Be sure to avoid checking in EWT data to the Text Extensions for Pandas repository.

frreiss commented 3 years ago

Some specific data sets that should work out of the box:

Universal Dependencies treebank
English Web Treebank (EWT)
CoNLL-09
OntoNotes (this one requires a license to access)

frreiss commented 3 years ago

Known "gotchas" with CoNLL-U: File encoding should be UTF-8, but can't rely on that.

Comments often contain data. For example, the first document in the dev fold of EWT:

# newdoc id = weblog-blogspot.com_nominations_20041117172713_ENG_20041117_172713
# sent_id = weblog-blogspot.com_nominations_20041117172713_ENG_20041117_172713-0001
# newpar id = weblog-blogspot.com_nominations_20041117172713_ENG_20041117_172713-p0001
# text = From the AP comes this story :
1   From    from    ADP IN  _   3   case    3:case  _
2   the the DET DT  Definite=Def|PronType=Art   3   det 3:det   _
3   AP  AP  PROPN   NNP Number=Sing 4   obl 4:obl:from  _
4   comes   come    VERB    VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   0   root    0:root  _
5   this    this    DET DT  Number=Sing|PronType=Dem    6   det 6:det   _
6   story   story   NOUN    NN  Number=Sing 4   nsubj   4:nsubj _
7   :   :   PUNCT   :   _   4   punct   4:punct _

There can be subword tokens. Users should be able to specify whether to merge these.

Sometimes there are multiple tokens on one line, followed by the same tokens broken out separately; example from EWT:

29-30   didn't  _   _   _   _   _   _   _   SpaceAfter=No
29  did do  VERB    VBD Mood=Ind|Tense=Past|VerbForm=Fin    4   conj    4:conj:but  _
30  n't not PART    RB  _   29  advmod  29:advmod   _

There can be artificial tokens (i.e. words that don't exist in the original text) that were added to fill out the parse tree.

There can be explicit information about whitespace:

Sentence text in a comment before the sentence
The SpaceAfter attribute in the last ("misc") field of a line

CODAIT / text-extensions-for-pandas

Update CoNLL-U support to fully cover EWT data set #191