Closed GoogleCodeExporter closed 9 years ago
We have a similar issue with class labels in the dense FeatureStore setting.
Class labels are created dynamically during training and testing; and hence
might be different at prediction time. This causes troubles in e.g. Weka. We
therefore manually "balance" train and test class labels before prediction. As
reference, we take the training data labels (i.e. class labels not found in
training data will be deleted from test data, and labels found in the training
data but not test data, will be added to test data). Maybe that's the way to go
in the sparse FeatureStore case as well.
Original comment by daxenber...@gmail.com
on 3 Nov 2014 at 11:28
Well, this sounds like stratification; But if stratification is not desired, I
think "balancing" the data labels is a workaround and I'm not sure if this is
actually correct. You learn the classifier with whatever data (features +
labels) you have and then the same features + labels should be used on the test
data, regardless of what's there.
Balancing the data using their features seems very non-standard... The
intuition is that the features would generalize well, but I cannot assume
anything about the test data - i.e. in a cross-domain settings. So then I can
use only features that I've seen during training and ignore the others (in case
they are dynamically generated). So the train and test task must be somehow
dependent.
Original comment by ivan.hab...@gmail.com
on 3 Nov 2014 at 11:45
With regard to class label "balancing" (maybe this term is not the best), we do
exactly that: we learn from the training data whatever we can, and apply that
to the test data. Unseen labels during training cannot be predicted and and the
same time, we cannot prevent the classifier from predicting valid labels seen
in the training data but not part of the true test data labels.
How to model this in the feature set might be a different question. But as long
as we are doing supervised learning, we can only learn from the training data.
Original comment by daxenber...@gmail.com
on 3 Nov 2014 at 11:56
We could do feature name balancing, but this also would mean to drop some extra
features that the test phase might have introduced.
If this is not a problem, the training task should write an additional file
with feature names. The test task could then load that and add missing features
and drop additional ones.
Original comment by torsten....@gmail.com
on 3 Nov 2014 at 3:29
> If this is not a problem, the training task should write an additional file
with feature names. The test task could then load that and add missing features
and drop additional ones.
I see it the same. Not sure, what's the best way to hack the TC regarding that.
> to drop some extra features that the test phase might have introduced.
Not sure if I get it right - this is usually the case, isn't it? I.e. if you do
feature selection, you keep only top k features (train the classifier) and
ignore the rest anyway even if it's in the data...
Original comment by ivan.hab...@gmail.com
on 3 Nov 2014 at 3:36
>> If this is not a problem, the training task should write an additional file
with feature names. The test task could then load that and add missing features
and drop additional ones.
>I see it the same. Not sure, what's the best way to hack the TC regarding that.
It shouldn't be too hard if you know where to look :)
We should keep the issue open, maybe someone (e.g. me :) will find time to
tackle this.
And we have TC hack day pretty soon ...
>> to drop some extra features that the test phase might have introduced.
>Not sure if I get it right - this is usually the case, isn't it? I.e. if you
do feature selection, you keep only top k >features (train the classifier) and
ignore the rest anyway even if it's in the data...
I am not sure about how feature selection is implemented in the different ML
frameworks. I thought that Weka applies feature selection as a filter to
training and test files, so that the filtering is already done.
At least without feature selection, having additional features in an ARFF
results in an error.
Original comment by torsten....@gmail.com
on 3 Nov 2014 at 3:54
Feature Selection is one way to "balance" train and test data (e.g. via Filters
in Weka), but in cases where feature selection is not applied, you're in
trouble. Anyways, manually adapting the features in the test data to those in
the training data shouldn't be too hard.
Original comment by daxenber...@gmail.com
on 3 Nov 2014 at 4:03
> It shouldn't be too hard if you know where to look :)
> We should keep the issue open, maybe someone (e.g. me :) will find time to
tackle this.
> And we have TC hack day pretty soon ...
Well in that case, I'd put a higher priority to it - it's kind of blocking me
from fast-prototyping features; the only way how to do that is to write a
MetaCollector for each feature type, which is not quite handy... Maybe you can
point me to the place where this could be solved.
Original comment by ivan.hab...@gmail.com
on 4 Nov 2014 at 1:30
I've added this issue to the agenda of the hackday.
Original comment by daxenber...@gmail.com
on 4 Nov 2014 at 1:41
I have already added code that writes the feature names.
The code that "harmonizes" the feature spaces is a bit more complicated but
could be written as a feature store filter relatively independent of the TC
inner working.
input(FStore + list of feature names)
output(FStore with only the features in the list)
@Ivan: if you could write such a filter, we can try to integrate into your
example as a prototype
Original comment by torsten....@gmail.com
on 4 Nov 2014 at 3:32
Thanks a lot, Torsten, I'll have a look at it!
Original comment by ivan.hab...@gmail.com
on 4 Nov 2014 at 3:33
@Torsten: Ok, now ExtractFeatureConnector.collectionProcessComplete() writes
the feature names to the xxx-Train-xxx/output/feature.names file in case of the
training mode. But how can I later access this file from
ExtractFeatureConnector.initialize() from a test task? I have UimaContext in
the method, but no idea, how to reach the data stored during the training task.
Original comment by ivan.hab...@gmail.com
on 5 Nov 2014 at 8:46
Do you need it in initialize()?
I thought we could use the new filter capabilities that I have added to
collection process complete (as described in comment 10)
But we will need the task context in the connector, so this should be a
parameter. I will make the necessary changes.
Original comment by torsten....@gmail.com
on 5 Nov 2014 at 8:55
> Do you need it in initialize()?
The intention is to "inject" all known feature names to the feature store
during testing from the ones that have been seen during training. Otherwise,
the feature space must be created ad-hoc when adding instances to the feature
store. I tried that with MetaCollector which ensures the same feature space is
used for training and test, but this is was too slow for sparse features.
>I thought we could use the new filter capabilities that I have added to
collection process complete (as described in comment 10)
This would be an option too, so it means: the test feature store accepts
instances with any possible features, and at the end it just retains the
feature known from training. In this case, one would again need the saved
features from the training step. (And again, no idea how to make the "transfer"
from training to test)
Original comment by ivan.hab...@gmail.com
on 5 Nov 2014 at 9:10
>>I thought we could use the new filter capabilities that I have added to
collection process complete (as described in comment 10)
>This would be an option too, so it means: the test feature store accepts
instances with any possible features, and at the end it just retains the
feature known from training. In this case, one would again need the saved
features from the training step. (And again, no idea how to make the "transfer"
from training to test)
This would be my preferred solution, as the the feature extractors just extract
whatever they find and we care later about which features to retain.
Regarding the "transfer" from train to test: I am currently looking into that
...
Original comment by torsten....@gmail.com
on 5 Nov 2014 at 9:39
Good, looking at your last commit (commit 1213), I can depart from here and
finish it... thx!
Original comment by ivan.hab...@gmail.com
on 5 Nov 2014 at 11:10
This issue was updated by revision r1215.
Implemented Filter; Extended FeatureStore interface (checking whether injecting
feature names allowed); updating implementation in SparseFeatureStore;
commenting out one test in TwentyNewsgroupsDemoTest (will open a separate issue)
Original comment by ivan.hab...@gmail.com
on 5 Nov 2014 at 1:14
Seems like being solved by now, works in my application with large and sparse
feature space. The standard SimpleFeatureStore was unaffected, possible other
bugs -> open new issues.
Original comment by ivan.hab...@gmail.com
on 5 Nov 2014 at 1:53
Original issue reported on code.google.com by
ivan.hab...@gmail.com
on 3 Nov 2014 at 11:11