dkpro / dkpro-core

Collection of software components for natural language processing (NLP) based on the Apache UIMA framework.
https://dkpro.github.io/dkpro-core
Other
194 stars 68 forks source link

Enable dependency-based embeddings #818

Open carschno opened 8 years ago

carschno commented 8 years ago

The current WordEmbeddingsEstimator implementation (cf. #798) uses the feature path of any annotation to estimate word embeddings. This does not work (straight-forwardly) with dependencies. However, dependency-based embeddings as proposed by Levy & Goldberg (2014) would be very nice to have, too.

Levy, Omer, and Yoav Goldberg. “Dependencybased Word Embeddings.” In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 2:302–8, 2014. http://www.aclweb.org/anthology/P14-2050.pdf.

reckart commented 8 years ago

I think using an expression language for this would be cool. Some options/pointers:

reckart commented 8 years ago

I think it is important that the mechanism supports flexible combination/concatenation of information obtained from different annotations, e.g. something like dep.governor.text+"-"+dep.governor.pos+"-"+dep.dependencyType+"-"+dep.dependent.text+"-"+dep.dependent.pos.

For this reason a path-language like UIMA feature path or Commons JXPath (used e.g. in DKPro Core TokenMerger) are not suitable for this case.

reckart commented 8 years ago

Hm, actually it might even be possible to use a Java 8 lambda function! Since lamda functions are classes, the parameter can simple be the class object of the lambda function (uimaFIT auto-converts Class to String and back).

createEngineDescription(SomeComponent.class, 
    SomeComponent.PARAM_SELECTOR, Dependency.class,
    SomeComponent.PARAM_EXTRACTOR, (Dependency dep) -> { dep.getGovernor().getCoveredText() ... })

It still is more verbose though than languages like groove where get and set prefixes are optional.

carschno commented 8 years ago

Another approach might be to create TokenSequences depending on annotation-defined contexts. Typically, a TokenSequence generated be TokenSequenceGenerator is a list of tokens in a document (or sentence etc.):

this is a sentence .

It could be extended to create TokenSequences e.g. from dependencies:

 australian/amod scientist discovers/nsubj−1