EmilStenstrom / conllu

A CoNLL-U parser that takes a CoNLL-U formatted string and turns it into a nested python dictionary.
MIT License
311 stars 50 forks source link

Generating TokenList of a sentence containing multi-word tokens causes improper list comprehensions #69

Closed Akshayanti closed 2 years ago

Akshayanti commented 2 years ago

Consider the part of sentence:

17  Sarajewo    Sarajewo    PROPN   NE  Case=Dat|Gender=Neut|Number=Sing    15  nmod    _   NamedEntity=Yes
18-19   zur _   _   _   _   _   _   _   _  
18  zu  zu  ADP APPR    _   21  case    _   _  
19  der der DET ART Case=Dat|Definite=Def|Gender=Fem|Number=Sing|PronType=Art   21  det _   _  
20  humanitären humanitär   ADJ ADJA    Case=Dat|Gender=Fem|Number=Sing 21  amod    _   _  

The TokenList rendition of this sentence returns wrong values. For example:

for x in range(17, 21): print(sentence[x]) returns output zur zu der humanitären

thereby messing the control during the list comprehension

EmilStenstrom commented 2 years ago

Hi! This is intended. Some people want to use the multi-word token instead of the individual tokens, and some want the opposite. This library does not choose, and instead just lets you decide which version you want.

An example of filtering away all the multi-word tokens is available in the docs: https://github.com/EmilStenstrom/conllu#new-in-conllu-43-filter-a-tokenlist-by-lambda

Does this make sense?

Akshayanti commented 2 years ago

that explains it, thank you! I will close the issue as not relevant.