Question about tokenizing words like 'tecum'

diyclassics commented 8 years ago

I noticed in the treebank data that compounds with "-cum"—like 'tecum'—are tokenized as a single token. E.g.

<word id="9" form="tecum" lemma="tu1" postag="p-s---mb-" head="11" relation="ADV"/>

Is there a reason that this is not tokenized as two tokens, i.e. 'cum' + 'te'? (Cf. 'neque' which is tokenized as 'que' + 'ne'.) From a treebanking point of view, it seems like this construction should be comparable to other prepositional phrases of the form 'cum' + abl. noun/pronoun.

More curiosity than anything else—I'm working on a Latin tokenizer myself and trying to follow Perseus NLP practice as closely as possible. Thanks!

ps. Are these tokenizing decisions documented anywhere that I can review?

balmas commented 8 years ago

@gcelano has been working on normalizing the Perseus treebank data but I'm not sure if tokenization is one of the issues he is addressing. We followed slightly different practices in the early days of the treebank than we do now. You can find some discussions on this topic in the issues list for the tokenizer we are currently using for Perseids: https://github.com/latin-language-toolkit/llt-tokenizer/issues

Switching to cltk from LLT (or offering CLTK as an alternative) is something I have been interested in pursuing as the LLT services are no longer being actively maintained.

We developed a RESTful API for the LLT tokenization and segmentation services that made it easy to integrate with other Perseids tools. It's not perfect but the functionality exposed there is maybe interesting to others doing this sort of work, and standardizing on a RESTful APIs for this functionality would make it much easier to swap different implementations in and out.

nevenjovanovic commented 8 years ago

Studying the treebanks in Latin Tündra Perseus, I am encountering cases of "nec" being analyzed into "c ne" (as @diyclassics seems also to have been doing in his tokenizer); cf. https://weblicht.sfs.uni-tuebingen.de/TundraPerseus/index.zul?tbname=PerseusLatin&tbsent=3119 . It is perfectly clear to me why it should be analyzed like this (I have read https://github.com/latin-language-toolkit/llt-tokenizer/issues/27). It is less clear, however, why the analysis should be displayed in this order, and not as ne c, or, even better, ne -c (cf. virum -que). The inverted order confuses readers of treebanked sentences, and appears, on the whole, unnecessarily clumsy from the linguistic point of view; elsewhere you don't change the original word order in the display. Nevertheless, there are 29 occurrences of non-analyzed "nec" in the treebank on Tündra, cf. e. g. https://weblicht.sfs.uni-tuebingen.de/TundraPerseus/index.zul?tbname=PerseusLatin&tbsent=471 , or search there for [word="nec"]. I think that the Latin treebank is inconsistent here, which is not acceptable if we want to have a gold standard.

And a sincere +1 for opening documentation on tokenizing decisions!

gcelano commented 8 years ago

Hi Neven,

You are right, and this is why now nec and neque are kept univerbated. Have a look at the repository, where there is a new version of data (2.1). There you find this problem solved for most texts, even though some of them (specified in the documentation) still need a major revision (which includes resolution of this problem).

I will ask that the new data be available in Tundra, but you may wait for some time (uploading not depending on me).

Best, Giuseppe

Il giorno 30 apr 2016, alle ore 11:13, Neven Jovanović notifications@github.com ha scritto:

Studying the treebanks in Latin Tündra Perseus, I am encountering cases of "nec" being analyzed into "c ne" (as @diyclassics seems also to have been doing in his tokenizer); cf. https://weblicht.sfs.uni-tuebingen.de/TundraPerseus/index.zul?tbname=PerseusLatin&tbsent=3119 . It is perfectly clear to me why it should be analyzed like this (I have read latin-language-toolkit/llt-tokenizer#27). It is less clear, however, why the analysis should be displayed in this order, and not as ne c, or, even better, ne -c (cf. virum -que). The inverted order confuses readers of treebanked sentences, and appears, on the whole, unnecessarily clumsy from the linguistic point of view; elsewhere you don't change the original word order in the display. Nevertheless, there are 29 occurrences of non-analyzed "nec" in the treebank on Tündra, cf. e. g. https://weblicht.sfs.uni-tuebingen.de/TundraPerseus//index.zul?tbname=PerseusLatin&tbsent=471 , or search there for [word="nec"]. I think that the Latin treebank is inconsistent here, which is not acceptable if we want to have a gold standard.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub

diyclassics commented 8 years ago

@gcelano—This is helpful to know—I wrote the CLTK tokenizer (with the nec/neque split and reversed order) based on the example from the treebank data. It is my goal with these tools to align them with large projects like the Perseus NLP research. I will likely change it back now, perhaps add a flag to keep this behavior if the user wants.

Any thoughts on "-cum" compounds? Thanks!

nevenjovanovic commented 8 years ago

@gcelano -- thanks for confirming my suspicions. I have consulted the documentation in https://github.com/PerseusDL/treebank_data/tree/master/v2.1/Latin , and things are much clearer now. The 2.1 Latin treebanks in the repo (as forked yesterday) still show 57 occurrences of //*[@form='c' and @lemma='que1'] when I load them in an XML database, in files:

phi0959.phi006.perseus-lat1.tb.xml
phi0972.phi001.perseus-lat1.xml
tlg0031.tlg027.perseus-lat1.tb.xml

We should think about how to fix this -- if you recommend that nec and neque remain unanalyzed (as, I guess, you are doing with οὐδέ in the Greek tb, and as it seems to be the current practice with Latin), I will talk with Filip to organize a correcting action and a pull request.

@diyclassics -- when e. g. the Morpheus parser in the Morphology Service analyzes nec (http://services.perseids.org/bsp/morphologyservice/analysis/word?lang=lat&word=nec&engine=morpheuslat), it does not split the word in any way. This is completely fine by me!

gcelano commented 8 years ago

Hi @diyclassics,

I would suggest to keep tecum (and similia) separated. This is much better. We need to correct the tokenizer at Perseus in this respect.

balmas commented 8 years ago

See https://github.com/perseids-project/llt-tokenizer/issues/1 for tracking of the requested tokenizer changes.

PerseusDL / treebank_data

Question about tokenizing words like 'tecum' #8