Open carschno opened 8 years ago
I think using an expression language for this would be cool. Some options/pointers:
I think it is important that the mechanism supports flexible combination/concatenation of information obtained from different annotations, e.g. something like dep.governor.text+"-"+dep.governor.pos+"-"+dep.dependencyType+"-"+dep.dependent.text+"-"+dep.dependent.pos
.
For this reason a path-language like UIMA feature path or Commons JXPath (used e.g. in DKPro Core TokenMerger) are not suitable for this case.
Hm, actually it might even be possible to use a Java 8 lambda function! Since lamda functions are classes, the parameter can simple be the class object of the lambda function (uimaFIT auto-converts Class to String and back).
createEngineDescription(SomeComponent.class,
SomeComponent.PARAM_SELECTOR, Dependency.class,
SomeComponent.PARAM_EXTRACTOR, (Dependency dep) -> { dep.getGovernor().getCoveredText() ... })
It still is more verbose though than languages like groove where get
and set
prefixes are optional.
Another approach might be to create TokenSequences depending on annotation-defined contexts. Typically, a TokenSequence generated be TokenSequenceGenerator is a list of tokens in a document (or sentence etc.):
this is a sentence .
It could be extended to create TokenSequences e.g. from dependencies:
australian/amod scientist discovers/nsubj−1
The current WordEmbeddingsEstimator implementation (cf. #798) uses the feature path of any annotation to estimate word embeddings. This does not work (straight-forwardly) with dependencies. However, dependency-based embeddings as proposed by Levy & Goldberg (2014) would be very nice to have, too.