AnantLabs / dkpro-tc

Automatically exported from code.google.com/p/dkpro-tc
Other
0 stars 0 forks source link

Potential severe problem with ngram meta collectors and extractors #207

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
In order to ensure that always the same ngram feature are generated, we use 
meta collectors that collect the frequencies of all ngrams and then select the 
top k as features.

The ngram annotators allow to select a range of ngram levels, e.g. uni, bi and 
trigrams.
As all ngrams are stored in the same frequency list, the higher counts will 
always be biased towards lower-grade ngrams.
I wonder whether we actually want that.

Example: if "kick the bucket" appears 20 times in the corpus, than by 
definition "kick", "the", "bucket", "kick the" and "the bucket" must also 
appear at least 20 times.
Thus, the frequency list will always be dominated by lower-grade ngrams.

Suggestion:
We should probably keep separate frequency lists for different ngram levels and 
then take k/n from each list.

Danger:
This will change almost all experimental results generated so far.

Original issue reported on code.google.com by torsten....@gmail.com on 29 Oct 2014 at 3:25

GoogleCodeExporter commented 9 years ago
Are you suggesting that if a 1-gram is part of a 2-gram, then the 1-gram should 
not be listed at all?

Original comment by richard.eckart on 29 Oct 2014 at 3:38

GoogleCodeExporter commented 9 years ago
No.
I am suggesting that  - due to the strong dependencies between counts - counts 
should be made independently for each ngram level.

Otherwise, if you want e.g. 100 trigrams in your features, you will also have 
something like 800 bigrams and 5000 unigrams as features, which is not what 
most people would expect here.

Original comment by torsten....@gmail.com on 29 Oct 2014 at 3:43

GoogleCodeExporter commented 9 years ago
This is an understood design feature.  I think the idea was, higher-N ngrams 
are so sparse, that we don't want them to occupy feature space unless they are 
actually frequent enough to compete with lower-N ngrams.  If someone wants lots 
of trigrams, they can constrain to minN=3, maxN=3.

Of course, it's always nice to have more control.

We could rewrite all the ngram classes, or we could tackle Issue 39 allowing 
multiple copies of the same FE in a single experiment, which would also solve 
the problem and actually permit much finer control (and would not change 
previous experiments, and would not risk overengineering the ngram FE classes). 
 I vote for #2.

Original comment by EmilyKJa...@gmail.com on 29 Oct 2014 at 4:55

GoogleCodeExporter commented 9 years ago
I am with Emily's proposal, as solution #2 would allow what Torsten suggested 
while at the same time, old experiments will still be reproducible.

Original comment by daxenber...@gmail.com on 29 Oct 2014 at 5:23

GoogleCodeExporter commented 9 years ago

Original comment by daxenber...@gmail.com on 11 Dec 2014 at 3:48