Closed Horsmann closed 7 years ago
See branch issue403
for the prototype
Alright, sounds reasonable. Thanks for tackling this!
@daxenberger I had a look at the feature filter. At the moment, so far I see, is only one filter provided in TC for achieving a uniform distribution of the outcomes. Here are my thoughts on this for revisiting how the filters are used. Maybe you can comment on this matter :)
I was wondering if a feature
filter should remove entire instances including all features. The way the feature filter could be
used at the moment makes it necessary to write to disc in case a filter is provided. The filter might
have to know all features/outcomes to perform its operation. This makes providing.
What I find better (at the moment) is if altering data i.e. removing instances, creating uniform distribution - short, bulk operations that need all data - are performed at the user-side. Either prepare the input data in a uni-form way or annotate targets
in a way that a uniform distribution is achieved.
The feature filter should only filter single feature instances, either by name or by value? This limits how the feature filter can be used but appears to be conceptually more clean to me. Furthermore, it makes it more easy to implement streaming. I can iterate the filters and kick-out instances that have feature-names or feature-values which should be removed while streaming the data and writing it to disc.
Is anyone emotionally attached to the current filtering concept? And, which machine learning classifiers use filtering? This is probably something that is a Weka-targeted feature? The current functionality probably originates from the early TC days with Weka being the gold-standard for what TC should support?
You can find more information on why and how the filters were introduced in the first place in https://github.com/dkpro/dkpro-tc/issues/210. I guess, before changing anything here, you should first go through that thread. Uniform class/label distribution, which is a requirement for most Weka classifiers, could be archived before of after creating the feature store. We could ask the user to make sure there is a uniform class distribution in training and test data (but mind that the user has no or little control over that in cross-validation settings. And the problem gets worse for multi-label scenarios, where sometimes we have a large and sparse label space (>100), so it might easily happen that some labels in the test data have not been seen while training). We could also balance classes right before classification in the TestTask.
Thanks.
Uniform class/label distribution, which is a requirement for most Weka classifiers,
This is more something that the user should guarantee otherwise the classifier will be biased but Weka wouldn't care if it gets 100 instances from A and only 10 from B ?
but mind that the user has no or little control over that in cross-validation settings
using a filter that does that would not really make things better in this case? When the CV data is forcefully uniformed some instances are just kicked-out. It would be more like CV minus arbitrary many instances?
So, at the moment it is enforced that only feature_names
and outcomes
from training can occur in the testing phase?
This filtering/balancing is less problematic I think. I am just bothered by the UniformityFilter because this one requires that you have to load all data for counting (this can be also streamed) and then re-read everything again for selective-writing with the desired distribution. If this filter would have an own interface e.g. BulkFilter or so, I could distinguish this kind of filtering from a filter which says drop all ngram_the
features. The latter one can be easily applied on the fly it does not need to know anything about the overall distribution.
Ok. I had a look again. Three things
1) ExperimentTrainTest does not set the isTesting
flag in the test task. Consequently, in ExtractFeaturesConnector
in the collectionProcessComplete()
methodapplyFeatureNameFilter();
is never called. This is probably a bug without consequences? I went back until 0.7.0
this missing flag is there since quite some time.
2) The name filtering is done by this snippet in the same connector class
AdaptTestToTrainingFeaturesFilter filter = new AdaptTestToTrainingFeaturesFilter();
// if feature space from training set and test set differs, apply the filter
// to keep only features seen during training
if (!trainFeatureNames.equals(featureStore.getFeatureNames())) {
filter.setFeatureNames(trainFeatureNames);
filter.applyFilter(featureStore);
}
@Override
public void applyFilter(FeatureStore store)
{
if (store.isSettingFeatureNamesAllowed()) {
store.setFeatureNames(this.trainingFeatureNames);
}
}
The problem is in particular for WEKA that the feature store in use is the DenseFeatureStore which unfortunately does not allow this operation
- Hence, even without the issue mentioned before this operation should have never had an effect?
3) Seems this feature-name thing is kind of unimportant and Weka can scope with this issue by itself?
I am emotionally very attached to the current filtering concept. I also know that it is being used for some specific purposes in experiments that would not be easy to implement without filters.
-Torsten
2017-03-29 16:26 GMT+02:00 Tobias Horsmann notifications@github.com:
Thanks.
Uniform class/label distribution, which is a requirement for most Weka classifiers, This is more something that the user should guarantee otherwise the classifier will be biased but Weka wouldn't care if it gets 100 instances from A and only 10 from B ?
but mind that the user has no or little control over that in cross-validation settings using a filter that does that would not really make things better in this case? When the CV data is forcefully uniformed some instances are just kicked-out. It would be more like CV minus arbitrary many instances?
So, at the moment it is enforced that only feature_names and outcomes from training can occur in the testing phase? This filtering/balancing is less problematic I think. I am just bothered by the UniformityFilter because this one requires that you have to load all data for counting (this can be also streamed) and then re-read everything again for selective-writing with the desired distribution. If this filter would have an own interface e.g. BulkFilter or so, I could distinguish this kind of filtering from a filter which says drop all ngram_the features. The latter one can be easily applied on the fly it does not need to know anything about the overall distribution.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-tc/issues/403#issuecomment-290106735, or mute the thread https://github.com/notifications/unsubscribe-auth/ACkQ4NiUy-OXYf07yFDluEn570q4-oaHks5rqmoBgaJpZM4MqsLb .
So, currently my biggest problem is ensuring that only features which occurred during training are used in the testing phase. If I recall correctly, this is what TC is supposed to do, right?
I am talking about this code block here https://github.com/dkpro/dkpro-tc/blob/master/dkpro-tc-core/src/main/java/org/dkpro/tc/core/task/uima/ExtractFeaturesConnector.java#L188
And the implementation of the DenseFeatureStore
which does not allow setting feature shows that this functionality has never worked? (https://github.com/dkpro/dkpro-tc/blob/master/dkpro-tc-fstore-simple/src/main/java/org/dkpro/tc/fstore/simple/DenseFeatureStore.java#L159) Is this needed at all? It seems like it is not. The SparseFeatureStore
does support this setting names thing but is not used as default by Weka.
Since I am touching this matter anyway I would fix this - question is do we need it. Do we really have to ensure that no unseen features occur in testing. I tend to think that is the problem of the classifier, to deal with unknown features. Seemingly, all our classifiers do that already. Or do you expect the classifier to perform better when this is ensured?
I have implemented streaming for all classifiers to directly write into the classifier format. Weka is the only adapter which has to do a detour and first write everything to disc (json) to collect all information the API requires.
@daxenberger Is there a way to use Weka without having to know the total number of instances in advance? The API seems to always want to know everything in advance this makes it a bit difficult when working with CASes as I don't know what and how may information is still coming.
I am seriously considering to abandon the Weka API and create the ARFF file from hand. I don't see much of a problem in doing things FIFO and create the weka header at the very end when I have processed everything (and have all information). This declaration in advance as the Weka API requires it is a huge pain.
None the less, @daxenberger do you have any setups at hand which you could use on the issue403
branch and see if/how things go for actual setups? Correctness-wise but also speed-wise
Is there a way to use Weka without having to know the total number of instances in advance?
You don't need to know the the total number of instances in advance. There might be other pieces of necessary information - not sure. However, abandoning the Weka API altogether and creating ARFFs manually has the huge problem that we cannot really make sure that the DataWriter produces valid output (or at least throws useful exceptions), which is a serious drawback IMHO.
do you have any setups at hand which you could use on the issue403 branch
I can try a few older things and see how it goes, will report back.
Maybe there is a way to check the validity of the written ARFF via Weka?
On the other hand, I guess that we understand the format well enough to write our subset of the format that is always valid.
@daxenberger @zesch Can someone make the ExperimentCrossValidation
import the InitTask output folder? I don't get the problem the imports are not found.
The whole idea is to collect all outcomes already during the InitTask when iterating the outcomes. This works for TrainTest but Crossvalidation fails becomes it doesnt know the InitTask.
I gave up. I reverted the changes should build again.
I am not fully convinced that the ArffSaver can be re-initialized without overwriting the old file. The process is Cas based so far I see it. I would need to re-init ArffSaver without deleting the already written information. So far I would say this is not supported.
You lost me. AFAIK ArffSaver is a build-in Weka method, so it should not be Cas-based. Also not clear why you want to re-initialize. I thought the whole idea was about not using the Weka stuff but building a work-around ...
Can someone make the ExperimentCrossValidation import the InitTask output folder? I don't get the problem the imports are not found.
I don't think you can import them directly. But you could add a dimension for this - as we do for list of files (l. 170 in ExperimentCrossValidation
).
I don't really understand how such a dimension would look like?
@reckart I need to hack into a CrossValidation setup that a file created by an outer-task is made available for one(or all) nested-tasks. Is there some way to pass-through infos from the outer lab task to the inner, nested ones?
I assume you have this structure? I thought that should work.
OuterBatchTask {
SubTaskA { produces: X }
InnerBatchTask {
SubTaskB { import: SubTaskA.X }
}
}
I think this is what I did. Take a look here please: https://github.com/dkpro/dkpro-tc/blob/163cdfee628a053d644262bf02bda19e7f62d5d9/dkpro-tc-ml/src/main/java/org/dkpro/tc/ml/ExperimentCrossValidation.java#L224
I added this line to import the outer init-task into the inner extract feature task (analog for the extract-test task)
When I run CV with this modification I get an import not found exception
Caused by: org.dkpro.lab.storage.UnresolvedImportException:
-Unable to resolve import of task [org.dkpro.tc.core.task.ExtractFeaturesTask-Train-TwentyNewsgroupsCV] pointing to [task-latest://org.dkpro.tc.core.task.InitTask-TwentyNewsgroupsCV/preprocessorOutputTrain]: Resolved context [InitTask-TwentyNewsgroupsCV-20170516155643982] not in scope [MetaInfoTask-TwentyNewsgroupsCV-20170516155647285]
-Unable to resolve import of task [org.dkpro.tc.core.task.ExtractFeaturesTask-Test-TwentyNewsgroupsCV] pointing to [task-latest://org.dkpro.tc.core.task.ExtractFeaturesTask-Train-TwentyNewsgroupsCV/output]; nested exception is org.dkpro.lab.storage.TaskContextNotFoundException: Task [org.dkpro.tc.core.task.ExtractFeaturesTask-Train-TwentyNewsgroupsCV] has never been executed.
-Unable to resolve import of task [org.dkpro.tc.ml.weka.task.WekaTestTask-TwentyNewsgroupsCV] pointing to [task-latest://org.dkpro.tc.core.task.ExtractFeaturesTask-Test-TwentyNewsgroupsCV/output]; nested exception is org.dkpro.lab.storage.TaskContextNotFoundException: Task [org.dkpro.tc.core.task.ExtractFeaturesTask-Test-TwentyNewsgroupsCV] has never been executed.; nested exception is org.dkpro.lab.storage.UnresolvedImportException: Unable to resolve import of task [org.dkpro.tc.core.task.ExtractFeaturesTask-Train-TwentyNewsgroupsCV] pointing to [task-latest://org.dkpro.tc.core.task.InitTask-TwentyNewsgroupsCV/preprocessorOutputTrain]: Resolved context [InitTask-TwentyNewsgroupsCV-20170516155643982] not in scope [MetaInfoTask-TwentyNewsgroupsCV-20170516155647285]
at org.dkpro.lab.engine.impl.BatchTaskEngine.executeConfiguration(BatchTaskEngine.java:263)
at org.dkpro.lab.engine.impl.BatchTaskEngine.run(BatchTaskEngine.java:133)
at org.dkpro.lab.engine.impl.BatchTaskEngine.runNewExecution(BatchTaskEngine.java:341)
at org.dkpro.lab.engine.impl.BatchTaskEngine.executeConfiguration(BatchTaskEngine.java:235)
... 5 more
Caused by: org.dkpro.lab.storage.UnresolvedImportException: Unable to resolve import of task [org.dkpro.tc.core.task.ExtractFeaturesTask-Train-TwentyNewsgroupsCV] pointing to [task-latest://org.dkpro.tc.core.task.InitTask-TwentyNewsgroupsCV/preprocessorOutputTrain]: Resolved context [InitTask-TwentyNewsgroupsCV-20170516155643982] not in scope [MetaInfoTask-TwentyNewsgroupsCV-20170516155647285]
at org.dkpro.lab.engine.impl.BatchTaskEngine$ScopedTaskContext.resolve(BatchTaskEngine.java:563)
at org.dkpro.lab.engine.impl.DefaultTaskContextFactory.resolveImports(DefaultTaskContextFactory.java:149)
at org.dkpro.lab.engine.impl.DefaultTaskContextFactory.createContext(DefaultTaskContextFactory.java:103)
at org.dkpro.lab.uima.engine.simple.SimpleExecutionEngine.run(SimpleExecutionEngine.java:77)
at org.dkpro.lab.engine.impl.BatchTaskEngine.runNewExecution(BatchTaskEngine.java:341)
at org.dkpro.lab.engine.impl.BatchTaskEngine.executeConfiguration(BatchTaskEngine.java:235)
... 8 more
I remember having encountered this issue a while ago: https://github.com/dkpro/dkpro-lab/issues/42 Maybe the fix does not apply here?
Von: Richard Eckart de Castilho notifications@github.com Antworten an: dkpro/dkpro-tc reply@reply.github.com Datum: Dienstag, 16. Mai 2017 um 15:42 An: dkpro/dkpro-tc dkpro-tc@noreply.github.com Cc: Johannes Daxenberger daxenberger@ukp.informatik.tu-darmstadt.de, Mention mention@noreply.github.com Betreff: Re: [dkpro/dkpro-tc] Replacing the feature store (#403)
I assume you have this structure? I thought that should work.
OuterBatchTask {
SubTaskA { produces: X }
InnerBatchTask {
SubTaskB { import: SubTaskA.X }
}
}
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/dkpro/dkpro-tc/issues/403#issuecomment-301785979, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AK99zcwM6Iu2CzNZV73S_skGnixG64F8ks5r6afCgaJpZM4MqsLb.
Why are there two MetaTask executions?
Resolved context [InitTask-TwentyNewsgroupsCV-20170516155643982] not in scope [MetaInfoTask-TwentyNewsgroupsCV-20170516155647285]
Looks like you are using timestamps now - so 20170516155647285
is later than 20170516155643982
- I wonder why the import did not resolve to the later one...
@reckart @daxenberger The meta task is in the inner task. Each CV-fold has it own one as it seems. So each fold should execute a Meta task. The issue Johannes linked looks like the problem into which I am running here? I am not really sure how to proceed here :/
I am not sure I have the time atm to look into this. Is there some kind of "check out and run" unit test which provokes this?
@reckart Yes. I prepared the setup that cases this error with the commit before this post.
Checkout branch issue403
and run WekaTwentyNewsgroupsDemo
(example not a junit test)
Located in dkpto-tc-examples/org.dkpro.tc.examples.single.document/WekaTwentyNewsgroupsDemo
Should quickly afterwards fail with aforementioned import
exception.
I think I found the problem in DKPro Lab. Cf. https://github.com/dkpro/dkpro-lab/issues/101.
Ok, this should work now.
I added new task into the pipeline. CollectionTask
- which is a MetaCollection in a sense but the name is already in use. The collection task runs over all data, train and test, to collect the outcomes.
Other MLA do some more or less expensive operations to fish-out the outcomes from the training files for mapping things. Seems overdue that this information is centrally provided. I am not very attached to the name of the task - if you have ideas for better names :) ...
merged. Our Jenkins did build.
@daxenberger can you test a bit with the current master? I think things look good from my side so far. btw. are there any (larger) multi-label classification data sets you could give away? The multi-label part has only one test-case at the moment.
@Horsmann any particular thing you want me to test?
are there any (larger) multi-label classification data sets you could give away?
See http://meka.sourceforge.net (bottom)
@daxenberger At the moment a general test for functionality. The tests are passing, so I assume everything should work but since the test cases do not cover everything I would be good if you could run 0.9.0
experiments on the snapshot and see what happens.
Speed-wise, because we have a new task now to collect the outcomes, we did not win much but the scaleability has certainly increased.
I started experimenting how to avoid using the feature store.
The feature store is probably the biggest (performance) bottleneck at the moment. (i) the store grows until it holds all features which will be plenty for large datasets (ii) it becomes extremely slow for large data sets when the gigantic store has to reallocate memory
The
sparse feature store
certainly avoids this problem to some extent. Still, in almost any case it would be more reasonable to transformInstance
immediately into the format of the classifier. This avoids the potentially expensive job of holding the store in memory and first writing everything into memory to just write it to disc once the last feature was written into the store.Some classifier need to know all labels in advance
Weka and maybe other classifiers are a bit more tricky when they have a header file which requires knowing all labels in advance. In this case, one could write to disc instead to memory and keep track during writing of all occurring labels/features. This file would have to be read in later on and than transformed into the respective classifier format. This should be (i) equally fast for small data sets (ii) makes the whole process still a lot more resource friendly.I added a branch where I
dirty-hacked
such a processing for Weka document/unit (hard-coded other classifier will crash at the moment on this branch). I used Weka as reference example because it seems most difficult to tackle. Weka writes now the instances to gson and reads them later on from this format again.I think this should in general speed-up TC even further. The feature
store
would vanish eventually. Noparallel
maintenance of both version. If streaming works for all classifier the store(s) should vanish.