Revise types for compound words

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1.
2.
3.

What is the expected output? What do you see instead?

Please use labels and text to provide additional information.

Original issue reported on code.google.com by richard.eckart on 27 Jun 2012 at 7:51

GoogleCodeExporter commented 9 years ago

Currently the type "Compound" supports only two splits (part1 and part2) and is 
even unable to annotate their begin and end positions.

I propose a new type system for compounds with three types:

- Compound extends Annotation
-- splits: List<Split>
- Split extends Annotation
-- linkingMorpheme : LinkingMorpheme
- LinkingMorpheme extends Annotation

This assumes that each Split has at most one following linking morpheme. I'm 
not sure if this can be assumed. 

Additionaly any compound splitter should support two modes:

- Annotate a token with the types defined above
- Split a token into new tokens that represent the splits either with trailing 
linking morphemes or without (controllable via an option)

Original comment by richard.eckart on 27 Jun 2012 at 7:58

Changed title: [api.segmentation] Revise types for compound words
Added labels: Type-Enhancement
Removed labels: Type-Defect

GoogleCodeExporter commented 9 years ago

This looks good.

>>This assumes that each Split has at most one following linking morpheme. I'm 
not sure if this can be assumed. 

I think it is safe to assume this, has also been assumed in this ACL paper on 
language independent compound splitting 
http://www.aclweb.org/anthology/P/P11/P11-1140.pdf

Judith

Original comment by eckle.kohler on 29 Jun 2012 at 8:12

GoogleCodeExporter commented 9 years ago

The above suggestion of the three types does not allow to model bracketing. 
Maybe the following system would be better:

type Compound extends Annotation {
  List<Split> splits;
}

type Split extends Annotation {
  List<Split> splits;
  String type [morpheme, linking-morpheme]
}

Depending on the configuration of the decompounding engine, this system allows 
to model:

1) a flat splitting scheme: with one Compound and several splits inside. Each 
split would not have additional children
2) a bracketed scheme: with one Compound at the root which has one child 
covering the complete token which in turn recursively has 2-5 children (S S - S 
L S - S S L S - S L S S - S L S L S) - this could be extended if the language 
analyzed requires it. The type feature indicates if a split is a linking 
morpheme or a split.

With the system above, there is also no restriction to a single linking 
morpheme per split.

Original comment by richard.eckart on 4 Aug 2012 at 4:47

GoogleCodeExporter commented 9 years ago

I have committed the types as described above for now.

Original comment by richard.eckart on 4 Aug 2012 at 6:56

Added labels: Milestone-1.4.0

GoogleCodeExporter commented 9 years ago

Original comment by richard.eckart on 13 Oct 2012 at 6:31

Added labels: DKPro-ASL

GoogleCodeExporter commented 9 years ago

Original comment by richard.eckart on 13 Oct 2012 at 6:33

Added labels: Milestone-1.5.0
Removed labels: Milestone-1.4.0

GoogleCodeExporter commented 9 years ago

Original comment by richard.eckart on 21 Feb 2013 at 9:49

Changed title: Revise types for compound words
Added labels: Module-api.segmentation

GoogleCodeExporter commented 9 years ago

Issue 138 has been merged into this issue.

Original comment by richard.eckart on 8 May 2013 at 1:44

GoogleCodeExporter commented 9 years ago

Pedro: Currently DKPro-Core has a Compound type and a Split type. However, 
there is no type representing a linking morpheme in DKPro-Core.

Original comment by richard.eckart on 8 May 2013 at 1:45

GoogleCodeExporter commented 9 years ago

In fact, after looking at the current types again (as explained eariler in this 
issue), we have a Split with type "linking-morpheme" to represent that. 

A problem though may be, that it's not easy to select(jcas, 
LinkingMorpheme.class) or select(jcas, Morpheme.class). If we want that, we may 
want to remove the "type" feature and instead add subclasses.

However, we still have some time before the release and can change that bit 
later too. Better get going with the integration of the new decompounding 
infrastructure ;)

Original comment by richard.eckart on 8 May 2013 at 2:09

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Besides that, it is also complicated to select the splits if the compound is 
composed with more than 2 words. I.e., for the compound 
"Doppelprozessormaschine", select(jcas, Split.class) will return 4 instances: 
"Doppel","prozessormaschine", "prozessor","maschine". So I really don't know if 
this Split type is the best way to represent a split.

Original comment by pedrobss...@gmail.com on 10 May 2013 at 3:03

GoogleCodeExporter commented 9 years ago

That depends on the strategy of mapping the split results to the CAS. 

* flat strategy: the annotator produces just produces a Compound with 4 Split 
elements in it. select(jcas, Split.class) will return 4 splits for the compound.
* tree strategy: the annotator produces a fully nested split tree. In that 
case, you wouldn't want to use select(jcas, Split.class) of course. You'd do a 
select(jcas, Compound.class) and then follow the split tree manually.

There may either be two different annotators, one for the "flat" annotation and 
one for the "tree" annotation style, or it may be a configuration parameter 
controlling the style.

Original comment by richard.eckart on 10 May 2013 at 3:07

GoogleCodeExporter commented 9 years ago

Yes, now it seems simpler. I haven't thought about the tree strategy. Good 
point.

Original comment by pedrobss...@gmail.com on 13 May 2013 at 10:27

GoogleCodeExporter commented 9 years ago

So, is this issue fixed? Is the morphemes selection problem still an issue?

Original comment by pedrobss...@gmail.com on 13 May 2013 at 10:37

GoogleCodeExporter commented 9 years ago

We have a different problem now: two types called "Morpheme". We should avoid 
having two type by the same name in different packages. We should try to get 
some linguistically motivated suggestion on what the type names should be. 

The "Morpheme" type for the splits is probably badly named (linking morpheme 
should be ok). The other "Morpheme" type we have is likewise badly named. I 
think both should be renamed.

Regarding the renaming the "Morpheme"-type we now have in the decompounding 
system: 
[1] calls these parts "heads"
[2] calls them "lexemes"
[3] calls them just "compound parts".

[1] http://www.aclweb.org/anthology/P/P11/P11-1140.pdf
[2] http://diotavelli.net/files/tmarek-linkmorphemes.pdf
[3] http://www.aclweb.org/anthology-new/P/P08/P08-2064.pdf

Original comment by richard.eckart on 13 May 2013 at 6:24

GoogleCodeExporter commented 9 years ago

I think CompoundPart would be a good one.

Original comment by pedrobss...@gmail.com on 13 May 2013 at 9:17

GoogleCodeExporter commented 9 years ago

Original comment by pedrobss...@gmail.com on 16 May 2013 at 2:03

Changed state: Fixed

aminorex / dkpro-core-asl

Revise types for compound words #75