Open GoogleCodeExporter opened 9 years ago
Wouldn't that be something like a DictionaryFeatureExtractor?
Original comment by richard.eckart
on 10 Nov 2013 at 1:26
There's the
de.tudarmstadt.ukp.dkpro.tc.features.content.TopicWordsFeatureExtractor
which is basically a simple dictionary feature extractor (and also should be
renamed, accordingly, btw)
It adds a hard coded prefix to each feature - I suggest we make this prefix
configurable, so you could set it to "Modal_" and can then identify Modal
features later on easily.
The mechanics of the extractor are pretty much the same for these cases.
Original comment by oliver.ferschke
on 10 Nov 2013 at 1:30
It also depends on what aspects of modal verbs you want to capture:
count in document?
text to modal verb ratio?
simple presence of any modal verbs in the text?
other?
These scenarios could also be implemented in a generic way for all use cases
based on dictionaries...
Original comment by oliver.ferschke
on 10 Nov 2013 at 1:34
yes, I think it would be something *similar* like that:
the generalized version - a DictionaryFeatureExtractor - would count
occurrences of items from a Dictionary (e.g. wordlist) AND it would aggregate
these counts:
so in the modal verbs example, not only the individual modal verbs (e.g. must,
should) are counted, but also "modals"
so it's more like a "DictionaryWordClassFeatureExtractor"
Original comment by eckle.kohler
on 10 Nov 2013 at 1:35
Another option would be to rely on POS tagging instead of a dictionary.
STTS has specific categories for modals.
I see two advantages: (i) you don't need to add all forms, and (ii) you don't
wrongly count surface forms that are not used as a model in a certain context.
Original comment by torsten....@gmail.com
on 10 Nov 2013 at 5:19
>>Another option would be to rely on POS tagging instead of a dictionary.
>>STTS has specific categories for modals.
I agree that this would in theory be preferable over word forms - however, only
if the POS tagger is able to tag modal verbs accurately. This would have to be
looked into. From my past experience with the STTS tagset / TreeTagger, I
recall that some of these smaller word classes are tagged wrongly and therefore
counting the lexical items was less noisy.
Original comment by eckle.kohler
on 11 Nov 2013 at 6:35
@Judith: do you still have plans to solve this issue?
Original comment by daxenber...@gmail.com
on 4 Jun 2014 at 11:55
yes, but not now - I need to first get an overview of the current state of TC
which I will do after the upcoming release
can you move it to milestone after the upcoming release, please
Original comment by eckle.kohler
on 4 Jun 2014 at 12:08
Original comment by daxenber...@gmail.com
on 4 Jun 2014 at 12:32
Original comment by daxenber...@gmail.com
on 29 Aug 2014 at 10:50
In order to determine word difficulty, I added some functions to determine
adjective endings, help verbs, modal verbs and auxiliary words for English,
German and French to de.tudarmstadt.ukp.dkpro.tc.features.readability.util. I
then noticed the AdjectiveEndingFeatureExtractor and the
ModalVerbsFeatureExtractor for English and this discussion.
I also added WordListExtractors that check if a word occurs in a list.
In both cases, I am not yet very happy with the solution, but maybe they can
revive this discussion to generalize FEs to other languages.
Original comment by lisa.bei...@gmail.com
on 12 Mar 2015 at 11:29
Original issue reported on code.google.com by
eckle.kohler
on 10 Nov 2013 at 1:17