dkpro / dkpro-core

Collection of software components for natural language processing (NLP) based on the Apache UIMA framework.
https://dkpro.github.io/dkpro-core
Other
194 stars 68 forks source link

Revise types for compound words #75

Closed reckart closed 9 years ago

reckart commented 9 years ago
What steps will reproduce the problem?
1.
2.
3.

What is the expected output? What do you see instead?

Please use labels and text to provide additional information.

Original issue reported on code.google.com by richard.eckart on 2012-06-27 19:51:33

reckart commented 9 years ago
Currently the type "Compound" supports only two splits (part1 and part2) and is even
unable to annotate their begin and end positions.

I propose a new type system for compounds with three types:

- Compound extends Annotation
-- splits: List<Split>
- Split extends Annotation
-- linkingMorpheme : LinkingMorpheme
- LinkingMorpheme extends Annotation

This assumes that each Split has at most one following linking morpheme. I'm not sure
if this can be assumed. 

Additionaly any compound splitter should support two modes:

- Annotate a token with the types defined above
- Split a token into new tokens that represent the splits either with trailing linking
morphemes or without (controllable via an option)

Original issue reported on code.google.com by richard.eckart on 2012-06-27 19:58:08

reckart commented 9 years ago
This looks good.

>>This assumes that each Split has at most one following linking morpheme. I'm not
sure if this can be assumed. 

I think it is safe to assume this, has also been assumed in this ACL paper on language
independent compound splitting http://www.aclweb.org/anthology/P/P11/P11-1140.pdf

Judith

Original issue reported on code.google.com by eckle.kohler on 2012-06-29 20:12:58

reckart commented 9 years ago
The above suggestion of the three types does not allow to model bracketing. Maybe the
following system would be better:

type Compound extends Annotation {
  List<Split> splits;
}

type Split extends Annotation {
  List<Split> splits;
  String type [morpheme, linking-morpheme]
}

Depending on the configuration of the decompounding engine, this system allows to model:

1) a flat splitting scheme: with one Compound and several splits inside. Each split
would not have additional children
2) a bracketed scheme: with one Compound at the root which has one child covering the
complete token which in turn recursively has 2-5 children (S S - S L S - S S L S -
S L S S - S L S L S) - this could be extended if the language analyzed requires it.
The type feature indicates if a split is a linking morpheme or a split.

With the system above, there is also no restriction to a single linking morpheme per
split.

Original issue reported on code.google.com by richard.eckart on 2012-08-04 16:47:59

reckart commented 9 years ago
I have committed the types as described above for now.

Original issue reported on code.google.com by richard.eckart on 2012-08-04 18:56:21

reckart commented 9 years ago
(No text was entered with this change)

Original issue reported on code.google.com by richard.eckart on 2012-10-13 18:31:41

reckart commented 9 years ago
(No text was entered with this change)

Original issue reported on code.google.com by richard.eckart on 2012-10-13 18:33:40

reckart commented 9 years ago
(No text was entered with this change)

Original issue reported on code.google.com by richard.eckart on 2013-02-21 09:49:54

reckart commented 9 years ago
Issue 138 has been merged into this issue.

Original issue reported on code.google.com by richard.eckart on 2013-05-08 13:44:54

reckart commented 9 years ago
Pedro: Currently DKPro-Core has a Compound type and a Split type. However, there is
no type representing a linking morpheme in DKPro-Core.

Original issue reported on code.google.com by richard.eckart on 2013-05-08 13:45:37

reckart commented 9 years ago
In fact, after looking at the current types again (as explained eariler in this issue),
we have a Split with type "linking-morpheme" to represent that. 

A problem though may be, that it's not easy to select(jcas, LinkingMorpheme.class)
or select(jcas, Morpheme.class). If we want that, we may want to remove the "type"
feature and instead add subclasses.

However, we still have some time before the release and can change that bit later too.
Better get going with the integration of the new decompounding infrastructure ;)

Original issue reported on code.google.com by richard.eckart on 2013-05-08 14:09:44

reckart commented 9 years ago
Besides that, it is also complicated to select the splits if the compound is composed
with more than 2 words. I.e., for the compound "Doppelprozessormaschine", select(jcas,
Split.class) will return 4 instances: "Doppel","prozessormaschine", "prozessor","maschine".
So I really don't know if this Split type is the best way to represent a split.

Original issue reported on code.google.com by pedrobssantos on 2013-05-10 15:03:14

reckart commented 9 years ago
That depends on the strategy of mapping the split results to the CAS. 

* flat strategy: the annotator produces just produces a Compound with 4 Split elements
in it. select(jcas, Split.class) will return 4 splits for the compound.
* tree strategy: the annotator produces a fully nested split tree. In that case, you
wouldn't want to use select(jcas, Split.class) of course. You'd do a select(jcas, Compound.class)
and then follow the split tree manually.

There may either be two different annotators, one for the "flat" annotation and one
for the "tree" annotation style, or it may be a configuration parameter controlling
the style.

Original issue reported on code.google.com by richard.eckart on 2013-05-10 15:07:48

reckart commented 9 years ago
Yes, now it seems simpler. I haven't thought about the tree strategy. Good point.

Original issue reported on code.google.com by pedrobssantos on 2013-05-13 10:27:35

reckart commented 9 years ago
So, is this issue fixed? Is the morphemes selection problem still an issue?

Original issue reported on code.google.com by pedrobssantos on 2013-05-13 10:37:45

reckart commented 9 years ago
We have a different problem now: two types called "Morpheme". We should avoid having
two type by the same name in different packages. We should try to get some linguistically
motivated suggestion on what the type names should be. 

The "Morpheme" type for the splits is probably badly named (linking morpheme should
be ok). The other "Morpheme" type we have is likewise badly named. I think both should
be renamed.

Regarding the renaming the "Morpheme"-type we now have in the decompounding system:

[1] calls these parts "heads"
[2] calls them "lexemes"
[3] calls them just "compound parts".

[1] http://www.aclweb.org/anthology/P/P11/P11-1140.pdf
[2] http://diotavelli.net/files/tmarek-linkmorphemes.pdf
[3] http://www.aclweb.org/anthology-new/P/P08/P08-2064.pdf

Original issue reported on code.google.com by richard.eckart on 2013-05-13 18:24:39

reckart commented 9 years ago
I think CompoundPart would be a good one.

Original issue reported on code.google.com by pedrobssantos on 2013-05-13 21:17:50

reckart commented 9 years ago
(No text was entered with this change)

Original issue reported on code.google.com by pedrobssantos on 2013-05-16 14:03:25