Closed frreiss closed 3 years ago
Some specific data sets that should work out of the box:
Known "gotchas" with CoNLL-U: File encoding should be UTF-8, but can't rely on that.
Comments often contain data. For example, the first document in the dev fold of EWT:
# newdoc id = weblog-blogspot.com_nominations_20041117172713_ENG_20041117_172713
# sent_id = weblog-blogspot.com_nominations_20041117172713_ENG_20041117_172713-0001
# newpar id = weblog-blogspot.com_nominations_20041117172713_ENG_20041117_172713-p0001
# text = From the AP comes this story :
1 From from ADP IN _ 3 case 3:case _
2 the the DET DT Definite=Def|PronType=Art 3 det 3:det _
3 AP AP PROPN NNP Number=Sing 4 obl 4:obl:from _
4 comes come VERB VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root 0:root _
5 this this DET DT Number=Sing|PronType=Dem 6 det 6:det _
6 story story NOUN NN Number=Sing 4 nsubj 4:nsubj _
7 : : PUNCT : _ 4 punct 4:punct _
There can be subword tokens. Users should be able to specify whether to merge these.
Sometimes there are multiple tokens on one line, followed by the same tokens broken out separately; example from EWT:
29-30 didn't _ _ _ _ _ _ _ SpaceAfter=No
29 did do VERB VBD Mood=Ind|Tense=Past|VerbForm=Fin 4 conj 4:conj:but _
30 n't not PART RB _ 29 advmod 29:advmod _
There can be artificial tokens (i.e. words that don't exist in the original text) that were added to fill out the parse tree.
There can be explicit information about whitespace:
SpaceAfter
attribute in the last ("misc") field of a line
text_extensions_for_pandas.io.conll.conll_2003_to_dataframes
currently has trouble parsing data files from the EWT data set for shallow semantic parsing (https://github.com/UniversalDependencies/UD_English-EWT).We need to update this support to cover this data set.
Recommended approach:
conll_u_to_dataframes()
conll_2003_to_dataframes
into a shared internal entry pointThe change set for this item should include a test case that downloads and parses part of the EWT data set when run. Be sure to cache the downloaded files. Be sure to avoid checking in EWT data to the Text Extensions for Pandas repository.