Inconsistent features (training vs. test) with sparse features

GoogleCodeExporter commented 9 years ago

Currently, the FeatureStore implementations (SimpleFeatureStore and 
SparseFeatureStore) create their feature space (set of feature names) 
dynamically. When an instance is added to it, all instance's features are added 
to the feature space. At the end, the feature store outputs all the features 
(sorted), and these are then written to feature vector files (for Weka, SVM, 
etc.).

If the instance's features are dense (= if the feature is present, emit a 
value, if the feature is absent, emit zero), this works fine. But in a scenario 
with sparse features, the feature extractors might only want to emit features 
that are present and ignore the others.

Unfortunately, this happens at two places independently. While training, an 
instance of FeatureStore is created, then all training instances are added, 
etc. At the test time, another FeatureStore instance is created, test instances 
are added, etc. - BUT if the test instances miss certain features, the feature 
space in the feature store will be different from the feature space in the 
training phase - simply because there are unseen features. The problem is that 
FeatureStore.getFeatureNames() will produce two different results.

I think it's a conceptual issue, because it actually drives one to use dense 
features even if they might be sparse in order to keep the feature vector 
mapping consistent.

So what would be the best way to transfer the feature names from training to 
test feature store?

Original issue reported on code.google.com by ivan.hab...@gmail.com on 3 Nov 2014 at 11:11

GoogleCodeExporter commented 9 years ago

We have a similar issue with class labels in the dense FeatureStore setting. 
Class labels are created dynamically during training and testing; and hence 
might be different at prediction time. This causes troubles in e.g. Weka. We 
therefore manually "balance" train and test class labels before prediction. As 
reference, we take the training data labels (i.e. class labels not found in 
training data will be deleted from test data, and labels found in the training 
data but not test data, will be added to test data). Maybe that's the way to go 
in the sparse FeatureStore case as well.

Original comment by daxenber...@gmail.com on 3 Nov 2014 at 11:28

GoogleCodeExporter commented 9 years ago

Well, this sounds like stratification; But if stratification is not desired, I 
think "balancing" the data labels is a workaround and I'm not sure if this is 
actually correct. You learn the classifier with whatever data (features + 
labels) you have and then the same features + labels should be used on the test 
data, regardless of what's there.

Balancing the data using their features seems very non-standard... The 
intuition is that the features would generalize well, but I cannot assume 
anything about the test data - i.e. in a cross-domain settings. So then I can 
use only features that I've seen during training and ignore the others (in case 
they are dynamically generated). So the train and test task must be somehow 
dependent.

Original comment by ivan.hab...@gmail.com on 3 Nov 2014 at 11:45

GoogleCodeExporter commented 9 years ago

With regard to class label "balancing" (maybe this term is not the best), we do 
exactly that: we learn from the training data whatever we can, and apply that 
to the test data. Unseen labels during training cannot be predicted and and the 
same time, we cannot prevent the classifier from predicting valid labels seen 
in the training data but not part of the true test data labels.
How to model this in the feature set might be a different question. But as long 
as we are doing supervised learning, we can only learn from the training data.

Original comment by daxenber...@gmail.com on 3 Nov 2014 at 11:56

GoogleCodeExporter commented 9 years ago

We could do feature name balancing, but this also would mean to drop some extra 
features that the test phase might have introduced.
If this is not a problem, the training task should write an additional file 
with feature names. The test task could then load that and add missing features 
and drop additional ones.

Original comment by torsten....@gmail.com on 3 Nov 2014 at 3:29

GoogleCodeExporter commented 9 years ago

> If this is not a problem, the training task should write an additional file 
with feature names. The test task could then load that and add missing features 
and drop additional ones.

I see it the same. Not sure, what's the best way to hack the TC regarding that.

> to drop some extra features that the test phase might have introduced.

Not sure if I get it right - this is usually the case, isn't it? I.e. if you do 
feature selection, you keep only top k features (train the classifier) and 
ignore the rest anyway even if it's in the data...

Original comment by ivan.hab...@gmail.com on 3 Nov 2014 at 3:36

GoogleCodeExporter commented 9 years ago

>> If this is not a problem, the training task should write an additional file 
with feature names. The test task could then load that and add missing features 
and drop additional ones.

>I see it the same. Not sure, what's the best way to hack the TC regarding that.

It shouldn't be too hard if you know where to look :)
We should keep the issue open, maybe someone (e.g. me :) will find time to 
tackle this.
And we have TC hack day pretty soon ...

>> to drop some extra features that the test phase might have introduced.

>Not sure if I get it right - this is usually the case, isn't it? I.e. if you 
do feature selection, you keep only top k >features (train the classifier) and 
ignore the rest anyway even if it's in the data...

I am not sure about how feature selection is implemented in the different ML 
frameworks. I thought that Weka applies feature selection as a filter to 
training and test files, so that the filtering is already done.

At least without feature selection, having additional features in an ARFF 
results in an error.

Original comment by torsten....@gmail.com on 3 Nov 2014 at 3:54

GoogleCodeExporter commented 9 years ago

Feature Selection is one way to "balance" train and test data (e.g. via Filters 
in Weka), but in cases where feature selection is not applied, you're in 
trouble. Anyways, manually adapting the features in the test data to those in 
the training data shouldn't be too hard.

Original comment by daxenber...@gmail.com on 3 Nov 2014 at 4:03

GoogleCodeExporter commented 9 years ago

> It shouldn't be too hard if you know where to look :)
> We should keep the issue open, maybe someone (e.g. me :) will find time to 
tackle this.
> And we have TC hack day pretty soon ...

Well in that case, I'd put a higher priority to it - it's kind of blocking me 
from fast-prototyping features; the only way how to do that is to write a 
MetaCollector for each feature type, which is not quite handy... Maybe you can 
point me to the place where this could be solved.

Original comment by ivan.hab...@gmail.com on 4 Nov 2014 at 1:30

GoogleCodeExporter commented 9 years ago

I've added this issue to the agenda of the hackday.

Original comment by daxenber...@gmail.com on 4 Nov 2014 at 1:41

GoogleCodeExporter commented 9 years ago

I have already added code that writes the feature names.

The code that "harmonizes" the feature spaces is a bit more complicated but 
could be written as a feature store filter relatively independent of the TC 
inner working.

input(FStore + list of feature names)
output(FStore with only the features in the list)

@Ivan: if you could write such a filter, we can try to integrate into your 
example as a prototype

Original comment by torsten....@gmail.com on 4 Nov 2014 at 3:32

GoogleCodeExporter commented 9 years ago

Thanks a lot, Torsten, I'll have a look at it!

Original comment by ivan.hab...@gmail.com on 4 Nov 2014 at 3:33

GoogleCodeExporter commented 9 years ago

@Torsten: Ok, now ExtractFeatureConnector.collectionProcessComplete() writes 
the feature names to the xxx-Train-xxx/output/feature.names file in case of the 
training mode. But how can I later access this file from 
ExtractFeatureConnector.initialize() from a test task? I have UimaContext in 
the method, but no idea, how to reach the data stored during the training task.

Original comment by ivan.hab...@gmail.com on 5 Nov 2014 at 8:46

GoogleCodeExporter commented 9 years ago

Do you need it in initialize()?

I thought we could use the new filter capabilities that I have added to 
collection process complete (as described in comment 10)

But we will need the task context in the connector, so this should be a 
parameter. I will make the necessary changes.

Original comment by torsten....@gmail.com on 5 Nov 2014 at 8:55

GoogleCodeExporter commented 9 years ago

> Do you need it in initialize()?

The intention is to "inject" all known feature names to the feature store 
during testing from the ones that have been seen during training. Otherwise, 
the feature space must be created ad-hoc when adding instances to the feature 
store. I tried that with MetaCollector which ensures the same feature space is 
used for training and test, but this is was too slow for sparse features.

>I thought we could use the new filter capabilities that I have added to 
collection process complete (as described in comment 10)

This would be an option too, so it means: the test feature store accepts 
instances with any possible features, and at the end it just retains the 
feature known from training. In this case, one would again need the saved 
features from the training step. (And again, no idea how to make the "transfer" 
from training to test)

Original comment by ivan.hab...@gmail.com on 5 Nov 2014 at 9:10

GoogleCodeExporter commented 9 years ago

>>I thought we could use the new filter capabilities that I have added to 
collection process complete (as described in comment 10)

>This would be an option too, so it means: the test feature store accepts 
instances with any possible features, and at the end it just retains the 
feature known from training. In this case, one would again need the saved 
features from the training step. (And again, no idea how to make the "transfer" 
from training to test)

This would be my preferred solution, as the the feature extractors just extract 
whatever they find and we care later about which features to retain.

Regarding the "transfer" from train to test: I am currently looking into that 
...

Original comment by torsten....@gmail.com on 5 Nov 2014 at 9:39

GoogleCodeExporter commented 9 years ago

Good, looking at your last commit (commit 1213), I can depart from here and 
finish it... thx!

Original comment by ivan.hab...@gmail.com on 5 Nov 2014 at 11:10

GoogleCodeExporter commented 9 years ago

This issue was updated by revision r1215.

Implemented Filter; Extended FeatureStore interface (checking whether injecting 
feature names allowed); updating implementation in SparseFeatureStore; 
commenting out one test in TwentyNewsgroupsDemoTest (will open a separate issue)

Original comment by ivan.hab...@gmail.com on 5 Nov 2014 at 1:14

GoogleCodeExporter commented 9 years ago

Seems like being solved by now, works in my application with large and sparse 
feature space. The standard SimpleFeatureStore was unaffected, possible other 
bugs -> open new issues.

Original comment by ivan.hab...@gmail.com on 5 Nov 2014 at 1:53

Changed state: Fixed

AnantLabs / dkpro-tc

Inconsistent features (training vs. test) with sparse features #210