Segmentation of the morphological and grammatical information

MedKhem commented 7 years ago

It's about extracting all morphological and grammatical information of the previous level. These information could figure directly in the \<entry> under the extracted \<form> block, in \<sense> or/and the \<re> blocks.

In the case of \<entry>, the following example:

becomes

In the case of \<sense>:

becomes

In the case of \<re>:
becomes

MedKhem commented 7 years ago

Given the non support of nested structures in one model, I recommend the use of two different models for this case:

The first model, "form" is to structure \<form> block into \<orth>, \<pron> and \<gramGrp> blocks. The \<orth> and \<pron> are then gathered under a \<form> element and the \<gramGrp> is further segmented with the second model
The second model, "grammatical_group" has the goal to segment the \<gramGrp>. For the moment, we use just 4 labels for this model: \<pos>, \<tns>, \<gen> and \<number>.

MedKhem commented 7 years ago

Mind that the "lexical-entry" model could be used to segment in a first step the \<re> block and then the "form" model could be applied. I would suggest, before using "lexical-entry" model, changing the feature generation for the "lexical-entry" to add a feature for distinguishing between a lexical entry and a related entry, since there are some specific lexical differences between them.

For extracting \<gramGrp> in sense, this should be ensured by a "sense_gram" model (to get \<gramGrp> and \<sense> blocks). For segmenting \<gramGrp>, the same model "grammatical_group" used for the \<form> block could be used for \<sense> block.

MedKhem / grobid-dictionaries

Segmentation of the morphological and grammatical information #7