Closed reckart closed 9 years ago
Currently the type "Compound" supports only two splits (part1 and part2) and is even
unable to annotate their begin and end positions.
I propose a new type system for compounds with three types:
- Compound extends Annotation
-- splits: List<Split>
- Split extends Annotation
-- linkingMorpheme : LinkingMorpheme
- LinkingMorpheme extends Annotation
This assumes that each Split has at most one following linking morpheme. I'm not sure
if this can be assumed.
Additionaly any compound splitter should support two modes:
- Annotate a token with the types defined above
- Split a token into new tokens that represent the splits either with trailing linking
morphemes or without (controllable via an option)
Original issue reported on code.google.com by richard.eckart
on 2012-06-27 19:58:08
This looks good.
>>This assumes that each Split has at most one following linking morpheme. I'm not
sure if this can be assumed.
I think it is safe to assume this, has also been assumed in this ACL paper on language
independent compound splitting http://www.aclweb.org/anthology/P/P11/P11-1140.pdf
Judith
Original issue reported on code.google.com by eckle.kohler
on 2012-06-29 20:12:58
The above suggestion of the three types does not allow to model bracketing. Maybe the
following system would be better:
type Compound extends Annotation {
List<Split> splits;
}
type Split extends Annotation {
List<Split> splits;
String type [morpheme, linking-morpheme]
}
Depending on the configuration of the decompounding engine, this system allows to model:
1) a flat splitting scheme: with one Compound and several splits inside. Each split
would not have additional children
2) a bracketed scheme: with one Compound at the root which has one child covering the
complete token which in turn recursively has 2-5 children (S S - S L S - S S L S -
S L S S - S L S L S) - this could be extended if the language analyzed requires it.
The type feature indicates if a split is a linking morpheme or a split.
With the system above, there is also no restriction to a single linking morpheme per
split.
Original issue reported on code.google.com by richard.eckart
on 2012-08-04 16:47:59
I have committed the types as described above for now.
Original issue reported on code.google.com by richard.eckart
on 2012-08-04 18:56:21
(No text was entered with this change)
Original issue reported on code.google.com by richard.eckart
on 2012-10-13 18:31:41
(No text was entered with this change)
Original issue reported on code.google.com by richard.eckart
on 2012-10-13 18:33:40
(No text was entered with this change)
Original issue reported on code.google.com by richard.eckart
on 2013-02-21 09:49:54
Issue 138 has been merged into this issue.
Original issue reported on code.google.com by richard.eckart
on 2013-05-08 13:44:54
Pedro: Currently DKPro-Core has a Compound type and a Split type. However, there is
no type representing a linking morpheme in DKPro-Core.
Original issue reported on code.google.com by richard.eckart
on 2013-05-08 13:45:37
In fact, after looking at the current types again (as explained eariler in this issue),
we have a Split with type "linking-morpheme" to represent that.
A problem though may be, that it's not easy to select(jcas, LinkingMorpheme.class)
or select(jcas, Morpheme.class). If we want that, we may want to remove the "type"
feature and instead add subclasses.
However, we still have some time before the release and can change that bit later too.
Better get going with the integration of the new decompounding infrastructure ;)
Original issue reported on code.google.com by richard.eckart
on 2013-05-08 14:09:44
Besides that, it is also complicated to select the splits if the compound is composed
with more than 2 words. I.e., for the compound "Doppelprozessormaschine", select(jcas,
Split.class) will return 4 instances: "Doppel","prozessormaschine", "prozessor","maschine".
So I really don't know if this Split type is the best way to represent a split.
Original issue reported on code.google.com by pedrobssantos
on 2013-05-10 15:03:14
That depends on the strategy of mapping the split results to the CAS.
* flat strategy: the annotator produces just produces a Compound with 4 Split elements
in it. select(jcas, Split.class) will return 4 splits for the compound.
* tree strategy: the annotator produces a fully nested split tree. In that case, you
wouldn't want to use select(jcas, Split.class) of course. You'd do a select(jcas, Compound.class)
and then follow the split tree manually.
There may either be two different annotators, one for the "flat" annotation and one
for the "tree" annotation style, or it may be a configuration parameter controlling
the style.
Original issue reported on code.google.com by richard.eckart
on 2013-05-10 15:07:48
Yes, now it seems simpler. I haven't thought about the tree strategy. Good point.
Original issue reported on code.google.com by pedrobssantos
on 2013-05-13 10:27:35
So, is this issue fixed? Is the morphemes selection problem still an issue?
Original issue reported on code.google.com by pedrobssantos
on 2013-05-13 10:37:45
We have a different problem now: two types called "Morpheme". We should avoid having
two type by the same name in different packages. We should try to get some linguistically
motivated suggestion on what the type names should be.
The "Morpheme" type for the splits is probably badly named (linking morpheme should
be ok). The other "Morpheme" type we have is likewise badly named. I think both should
be renamed.
Regarding the renaming the "Morpheme"-type we now have in the decompounding system:
[1] calls these parts "heads"
[2] calls them "lexemes"
[3] calls them just "compound parts".
[1] http://www.aclweb.org/anthology/P/P11/P11-1140.pdf
[2] http://diotavelli.net/files/tmarek-linkmorphemes.pdf
[3] http://www.aclweb.org/anthology-new/P/P08/P08-2064.pdf
Original issue reported on code.google.com by richard.eckart
on 2013-05-13 18:24:39
I think CompoundPart would be a good one.
Original issue reported on code.google.com by pedrobssantos
on 2013-05-13 21:17:50
(No text was entered with this change)
Original issue reported on code.google.com by pedrobssantos
on 2013-05-16 14:03:25
Original issue reported on code.google.com by
richard.eckart
on 2012-06-27 19:51:33