dkpro / dkpro-tc

UIMA-based text classification framework built on top of DKPro Core and DKPro Lab.
https://dkpro.github.io/dkpro-tc/
Other
34 stars 19 forks source link

ModalVerbsFeatureExtractor for German #58

Closed daxenberger closed 8 years ago

daxenberger commented 9 years ago

Originally reported on Google Code with ID 58

It would be very nice to have a ModalVerbsFeatureExtractor for German as well.

The actual modal verbs that are looked up in this FE extractor could be passed as a
wordlist parameter. So the ModalVerbsFeatureExtractor would be easier to use across
languages.

Reported by eckle.kohler on 2013-11-10 13:17:35

daxenberger commented 9 years ago
Wouldn't that be something like a DictionaryFeatureExtractor?

Reported by richard.eckart on 2013-11-10 13:26:10

daxenberger commented 9 years ago
There's the de.tudarmstadt.ukp.dkpro.tc.features.content.TopicWordsFeatureExtractor
which is basically a simple dictionary feature extractor (and also should be renamed,
accordingly, btw)

It adds a hard coded prefix to each feature - I suggest we make this prefix configurable,
so you could set it to "Modal_" and can then identify Modal features later on easily.
The mechanics of the extractor are pretty much the same for these cases.

Reported by oliver.ferschke on 2013-11-10 13:30:31

daxenberger commented 9 years ago
It also depends on what aspects of modal verbs you want to capture:

count in document?
text to modal verb ratio?
simple presence of any modal verbs in the text?
other?

These scenarios could also be implemented in a generic way for all use cases based
on dictionaries...

Reported by oliver.ferschke on 2013-11-10 13:34:49

daxenberger commented 9 years ago
yes, I think it would be something *similar* like that:
the generalized version - a DictionaryFeatureExtractor - would count occurrences of
items from a Dictionary (e.g. wordlist) AND it would aggregate these counts:

so in the modal verbs example, not only the individual modal verbs (e.g. must, should)
are counted, but also "modals"

so it's more like a "DictionaryWordClassFeatureExtractor"

Reported by eckle.kohler on 2013-11-10 13:35:24

daxenberger commented 9 years ago
Another option would be to rely on POS tagging instead of a dictionary.
STTS has specific categories for modals.
I see two advantages: (i) you don't need to add all forms, and (ii) you don't wrongly
count surface forms that are not used as a model in a certain context.

Reported by torsten.zesch on 2013-11-10 17:19:57

daxenberger commented 9 years ago
>>Another option would be to rely on POS tagging instead of a dictionary.
>>STTS has specific categories for modals.

I agree that this would in theory be preferable over word forms - however, only if
the POS tagger is able to tag modal verbs accurately. This would have to be looked
into. From my past experience with the STTS tagset / TreeTagger, I recall that some
of these smaller word classes are tagged wrongly and therefore counting the lexical
items was less noisy.

Reported by eckle.kohler on 2013-11-11 06:35:35

daxenberger commented 9 years ago
@Judith: do you still have plans to solve this issue?

Reported by daxenberger.j on 2014-06-04 11:55:05

daxenberger commented 9 years ago
yes, but not now - I need to first get an overview of the current state of TC which
I will do after the upcoming release

can you move it to milestone after the upcoming release, please

Reported by eckle.kohler on 2014-06-04 12:08:15

daxenberger commented 9 years ago

Reported by daxenberger.j on 2014-06-04 12:32:36

daxenberger commented 9 years ago

Reported by daxenberger.j on 2014-08-29 10:50:13

daxenberger commented 9 years ago
In order to determine word difficulty, I added some functions to determine adjective
endings, help verbs, modal verbs and auxiliary words for English, German and French
to de.tudarmstadt.ukp.dkpro.tc.features.readability.util. I then noticed the AdjectiveEndingFeatureExtractor
and the ModalVerbsFeatureExtractor for English and this discussion.  
I also added WordListExtractors that check if a word occurs in a list. 

In both cases, I am not yet very happy with the solution, but maybe they can revive
this discussion to generalize FEs to other languages.  

Reported by lisa.beinborn on 2015-03-12 11:29:53

Horsmann commented 8 years ago

This issue has been reported in 2013 and no one seems to care about it anymore - I close this one.