MedKhem / grobid-dictionaries

31 stars 7 forks source link

Segmentation of the morphological and grammatical information #7

Closed MedKhem closed 6 years ago

MedKhem commented 7 years ago

It's about extracting all morphological and grammatical information of the previous level. These information could figure directly in the \<entry> under the extracted \<form> block, in \<sense> or/and the \<re> blocks.

screen shot 2017-04-28 at 16 14 44 screen shot 2017-04-28 at 16 27 36

becomes

screen shot 2017-04-28 at 16 06 48
MedKhem commented 7 years ago

Given the non support of nested structures in one model, I recommend the use of two different models for this case:

  1. The first model, "form" is to structure \<form> block into \<orth>, \<pron> and \<gramGrp> blocks. The \<orth> and \<pron> are then gathered under a \<form> element and the \<gramGrp> is further segmented with the second model

  2. The second model, "grammatical_group" has the goal to segment the \<gramGrp>. For the moment, we use just 4 labels for this model: \<pos>, \<tns>, \<gen> and \<number>.

MedKhem commented 7 years ago

Mind that the "lexical-entry" model could be used to segment in a first step the \<re> block and then the "form" model could be applied. I would suggest, before using "lexical-entry" model, changing the feature generation for the "lexical-entry" to add a feature for distinguishing between a lexical entry and a related entry, since there are some specific lexical differences between them.

For extracting \<gramGrp> in sense, this should be ensured by a "sense_gram" model (to get \<gramGrp> and \<sense> blocks). For segmenting \<gramGrp>, the same model "grammatical_group" used for the \<form> block could be used for \<sense> block.