Replacing the feature store

Horsmann commented 7 years ago

I started experimenting how to avoid using the feature store.

The feature store is probably the biggest (performance) bottleneck at the moment. (i) the store grows until it holds all features which will be plenty for large datasets (ii) it becomes extremely slow for large data sets when the gigantic store has to reallocate memory

The sparse feature store certainly avoids this problem to some extent. Still, in almost any case it would be more reasonable to transform Instance immediately into the format of the classifier. This avoids the potentially expensive job of holding the store in memory and first writing everything into memory to just write it to disc once the last feature was written into the store.

Some classifier need to know all labels in advance Weka and maybe other classifiers are a bit more tricky when they have a header file which requires knowing all labels in advance. In this case, one could write to disc instead to memory and keep track during writing of all occurring labels/features. This file would have to be read in later on and than transformed into the respective classifier format. This should be (i) equally fast for small data sets (ii) makes the whole process still a lot more resource friendly.

I added a branch where I dirty-hacked such a processing for Weka document/unit (hard-coded other classifier will crash at the moment on this branch). I used Weka as reference example because it seems most difficult to tackle. Weka writes now the instances to gson and reads them later on from this format again.

I think this should in general speed-up TC even further. The feature store would vanish eventually. No parallel maintenance of both version. If streaming works for all classifier the store(s) should vanish.

Horsmann commented 7 years ago

See branch issue403 for the prototype

daxenberger commented 7 years ago

Alright, sounds reasonable. Thanks for tackling this!

Horsmann commented 7 years ago

@daxenberger I had a look at the feature filter. At the moment, so far I see, is only one filter provided in TC for achieving a uniform distribution of the outcomes. Here are my thoughts on this for revisiting how the filters are used. Maybe you can comment on this matter :)

I was wondering if a feature filter should remove entire instances including all features. The way the feature filter could be used at the moment makes it necessary to write to disc in case a filter is provided. The filter might have to know all features/outcomes to perform its operation. This makes providing.

What I find better (at the moment) is if altering data i.e. removing instances, creating uniform distribution - short, bulk operations that need all data - are performed at the user-side. Either prepare the input data in a uni-form way or annotate targets in a way that a uniform distribution is achieved.

The feature filter should only filter single feature instances, either by name or by value? This limits how the feature filter can be used but appears to be conceptually more clean to me. Furthermore, it makes it more easy to implement streaming. I can iterate the filters and kick-out instances that have feature-names or feature-values which should be removed while streaming the data and writing it to disc.

Is anyone emotionally attached to the current filtering concept? And, which machine learning classifiers use filtering? This is probably something that is a Weka-targeted feature? The current functionality probably originates from the early TC days with Weka being the gold-standard for what TC should support?

daxenberger commented 7 years ago

You can find more information on why and how the filters were introduced in the first place in https://github.com/dkpro/dkpro-tc/issues/210. I guess, before changing anything here, you should first go through that thread. Uniform class/label distribution, which is a requirement for most Weka classifiers, could be archived before of after creating the feature store. We could ask the user to make sure there is a uniform class distribution in training and test data (but mind that the user has no or little control over that in cross-validation settings. And the problem gets worse for multi-label scenarios, where sometimes we have a large and sparse label space (>100), so it might easily happen that some labels in the test data have not been seen while training). We could also balance classes right before classification in the TestTask.

Horsmann commented 7 years ago

Thanks.

Uniform class/label distribution, which is a requirement for most Weka classifiers,

This is more something that the user should guarantee otherwise the classifier will be biased but Weka wouldn't care if it gets 100 instances from A and only 10 from B ?

but mind that the user has no or little control over that in cross-validation settings

using a filter that does that would not really make things better in this case? When the CV data is forcefully uniformed some instances are just kicked-out. It would be more like CV minus arbitrary many instances?

So, at the moment it is enforced that only feature_names and outcomes from training can occur in the testing phase? This filtering/balancing is less problematic I think. I am just bothered by the UniformityFilter because this one requires that you have to load all data for counting (this can be also streamed) and then re-read everything again for selective-writing with the desired distribution. If this filter would have an own interface e.g. BulkFilter or so, I could distinguish this kind of filtering from a filter which says drop all ngram_the features. The latter one can be easily applied on the fly it does not need to know anything about the overall distribution.

Horsmann commented 7 years ago

Ok. I had a look again. Three things

1) ExperimentTrainTest does not set the isTesting flag in the test task. Consequently, in ExtractFeaturesConnector in the collectionProcessComplete() methodapplyFeatureNameFilter(); is never called. This is probably a bug without consequences? I went back until 0.7.0 this missing flag is there since quite some time.

2) The name filtering is done by this snippet in the same connector class

        AdaptTestToTrainingFeaturesFilter filter = new AdaptTestToTrainingFeaturesFilter();
        // if feature space from training set and test set differs, apply the filter
        // to keep only features seen during training
        if (!trainFeatureNames.equals(featureStore.getFeatureNames())) {
            filter.setFeatureNames(trainFeatureNames);
            filter.applyFilter(featureStore);
        }  

    @Override
    public void applyFilter(FeatureStore store)
    {
        if (store.isSettingFeatureNamesAllowed()) {
            store.setFeatureNames(this.trainingFeatureNames);
        }
    }

The problem is in particular for WEKA that the feature store in use is the DenseFeatureStore which unfortunately does not allow this operation - Hence, even without the issue mentioned before this operation should have never had an effect?

3) Seems this feature-name thing is kind of unimportant and Weka can scope with this issue by itself?

zesch commented 7 years ago

I am emotionally very attached to the current filtering concept. I also know that it is being used for some specific purposes in experiments that would not be easy to implement without filters.

-Torsten

2017-03-29 16:26 GMT+02:00 Tobias Horsmann notifications@github.com:

Thanks.

Uniform class/label distribution, which is a requirement for most Weka classifiers, This is more something that the user should guarantee otherwise the classifier will be biased but Weka wouldn't care if it gets 100 instances from A and only 10 from B ?

but mind that the user has no or little control over that in cross-validation settings using a filter that does that would not really make things better in this case? When the CV data is forcefully uniformed some instances are just kicked-out. It would be more like CV minus arbitrary many instances?

So, at the moment it is enforced that only feature_names and outcomes from training can occur in the testing phase? This filtering/balancing is less problematic I think. I am just bothered by the UniformityFilter because this one requires that you have to load all data for counting (this can be also streamed) and then re-read everything again for selective-writing with the desired distribution. If this filter would have an own interface e.g. BulkFilter or so, I could distinguish this kind of filtering from a filter which says drop all ngram_the features. The latter one can be easily applied on the fly it does not need to know anything about the overall distribution.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-tc/issues/403#issuecomment-290106735, or mute the thread https://github.com/notifications/unsubscribe-auth/ACkQ4NiUy-OXYf07yFDluEn570q4-oaHks5rqmoBgaJpZM4MqsLb .

Horsmann commented 7 years ago

So, currently my biggest problem is ensuring that only features which occurred during training are used in the testing phase. If I recall correctly, this is what TC is supposed to do, right?

I am talking about this code block here https://github.com/dkpro/dkpro-tc/blob/master/dkpro-tc-core/src/main/java/org/dkpro/tc/core/task/uima/ExtractFeaturesConnector.java#L188

And the implementation of the DenseFeatureStore which does not allow setting feature shows that this functionality has never worked? (https://github.com/dkpro/dkpro-tc/blob/master/dkpro-tc-fstore-simple/src/main/java/org/dkpro/tc/fstore/simple/DenseFeatureStore.java#L159) Is this needed at all? It seems like it is not. The SparseFeatureStore does support this setting names thing but is not used as default by Weka.

Since I am touching this matter anyway I would fix this - question is do we need it. Do we really have to ensure that no unseen features occur in testing. I tend to think that is the problem of the classifier, to deal with unknown features. Seemingly, all our classifiers do that already. Or do you expect the classifier to perform better when this is ensured?

Horsmann commented 7 years ago

I have implemented streaming for all classifiers to directly write into the classifier format. Weka is the only adapter which has to do a detour and first write everything to disc (json) to collect all information the API requires.

@daxenberger Is there a way to use Weka without having to know the total number of instances in advance? The API seems to always want to know everything in advance this makes it a bit difficult when working with CASes as I don't know what and how may information is still coming.

I am seriously considering to abandon the Weka API and create the ARFF file from hand. I don't see much of a problem in doing things FIFO and create the weka header at the very end when I have processed everything (and have all information). This declaration in advance as the Weka API requires it is a huge pain.

None the less, @daxenberger do you have any setups at hand which you could use on the issue403 branch and see if/how things go for actual setups? Correctness-wise but also speed-wise

daxenberger commented 7 years ago

Is there a way to use Weka without having to know the total number of instances in advance?

You don't need to know the the total number of instances in advance. There might be other pieces of necessary information - not sure. However, abandoning the Weka API altogether and creating ARFFs manually has the huge problem that we cannot really make sure that the DataWriter produces valid output (or at least throws useful exceptions), which is a serious drawback IMHO.

do you have any setups at hand which you could use on the issue403 branch

I can try a few older things and see how it goes, will report back.

zesch commented 7 years ago

Maybe there is a way to check the validity of the written ARFF via Weka?

On the other hand, I guess that we understand the format well enough to write our subset of the format that is always valid.

Horsmann commented 7 years ago

@daxenberger @zesch Can someone make the ExperimentCrossValidation import the InitTask output folder? I don't get the problem the imports are not found.

The whole idea is to collect all outcomes already during the InitTask when iterating the outcomes. This works for TrainTest but Crossvalidation fails becomes it doesnt know the InitTask.

Horsmann commented 7 years ago

I gave up. I reverted the changes should build again.

I am not fully convinced that the ArffSaver can be re-initialized without overwriting the old file. The process is Cas based so far I see it. I would need to re-init ArffSaver without deleting the already written information. So far I would say this is not supported.

zesch commented 7 years ago

You lost me. AFAIK ArffSaver is a build-in Weka method, so it should not be Cas-based. Also not clear why you want to re-initialize. I thought the whole idea was about not using the Weka stuff but building a work-around ...

daxenberger commented 7 years ago

Can someone make the ExperimentCrossValidation import the InitTask output folder? I don't get the problem the imports are not found.

I don't think you can import them directly. But you could add a dimension for this - as we do for list of files (l. 170 in ExperimentCrossValidation).

Horsmann commented 7 years ago

I don't really understand how such a dimension would look like?

@reckart I need to hack into a CrossValidation setup that a file created by an outer-task is made available for one(or all) nested-tasks. Is there some way to pass-through infos from the outer lab task to the inner, nested ones?

reckart commented 7 years ago

I assume you have this structure? I thought that should work.


OuterBatchTask {
   SubTaskA { produces: X }
   InnerBatchTask {
     SubTaskB { import: SubTaskA.X }
   }
}

Horsmann commented 7 years ago

I think this is what I did. Take a look here please: https://github.com/dkpro/dkpro-tc/blob/163cdfee628a053d644262bf02bda19e7f62d5d9/dkpro-tc-ml/src/main/java/org/dkpro/tc/ml/ExperimentCrossValidation.java#L224

I added this line to import the outer init-task into the inner extract feature task (analog for the extract-test task)

When I run CV with this modification I get an import not found exception

Caused by: org.dkpro.lab.storage.UnresolvedImportException: 
 -Unable to resolve import of task [org.dkpro.tc.core.task.ExtractFeaturesTask-Train-TwentyNewsgroupsCV] pointing to [task-latest://org.dkpro.tc.core.task.InitTask-TwentyNewsgroupsCV/preprocessorOutputTrain]: Resolved context [InitTask-TwentyNewsgroupsCV-20170516155643982] not in scope [MetaInfoTask-TwentyNewsgroupsCV-20170516155647285]
 -Unable to resolve import of task [org.dkpro.tc.core.task.ExtractFeaturesTask-Test-TwentyNewsgroupsCV] pointing to [task-latest://org.dkpro.tc.core.task.ExtractFeaturesTask-Train-TwentyNewsgroupsCV/output]; nested exception is org.dkpro.lab.storage.TaskContextNotFoundException: Task [org.dkpro.tc.core.task.ExtractFeaturesTask-Train-TwentyNewsgroupsCV] has never been executed.
 -Unable to resolve import of task [org.dkpro.tc.ml.weka.task.WekaTestTask-TwentyNewsgroupsCV] pointing to [task-latest://org.dkpro.tc.core.task.ExtractFeaturesTask-Test-TwentyNewsgroupsCV/output]; nested exception is org.dkpro.lab.storage.TaskContextNotFoundException: Task [org.dkpro.tc.core.task.ExtractFeaturesTask-Test-TwentyNewsgroupsCV] has never been executed.; nested exception is org.dkpro.lab.storage.UnresolvedImportException: Unable to resolve import of task [org.dkpro.tc.core.task.ExtractFeaturesTask-Train-TwentyNewsgroupsCV] pointing to [task-latest://org.dkpro.tc.core.task.InitTask-TwentyNewsgroupsCV/preprocessorOutputTrain]: Resolved context [InitTask-TwentyNewsgroupsCV-20170516155643982] not in scope [MetaInfoTask-TwentyNewsgroupsCV-20170516155647285]
    at org.dkpro.lab.engine.impl.BatchTaskEngine.executeConfiguration(BatchTaskEngine.java:263)
    at org.dkpro.lab.engine.impl.BatchTaskEngine.run(BatchTaskEngine.java:133)
    at org.dkpro.lab.engine.impl.BatchTaskEngine.runNewExecution(BatchTaskEngine.java:341)
    at org.dkpro.lab.engine.impl.BatchTaskEngine.executeConfiguration(BatchTaskEngine.java:235)
    ... 5 more
Caused by: org.dkpro.lab.storage.UnresolvedImportException: Unable to resolve import of task [org.dkpro.tc.core.task.ExtractFeaturesTask-Train-TwentyNewsgroupsCV] pointing to [task-latest://org.dkpro.tc.core.task.InitTask-TwentyNewsgroupsCV/preprocessorOutputTrain]: Resolved context [InitTask-TwentyNewsgroupsCV-20170516155643982] not in scope [MetaInfoTask-TwentyNewsgroupsCV-20170516155647285]
    at org.dkpro.lab.engine.impl.BatchTaskEngine$ScopedTaskContext.resolve(BatchTaskEngine.java:563)
    at org.dkpro.lab.engine.impl.DefaultTaskContextFactory.resolveImports(DefaultTaskContextFactory.java:149)
    at org.dkpro.lab.engine.impl.DefaultTaskContextFactory.createContext(DefaultTaskContextFactory.java:103)
    at org.dkpro.lab.uima.engine.simple.SimpleExecutionEngine.run(SimpleExecutionEngine.java:77)
    at org.dkpro.lab.engine.impl.BatchTaskEngine.runNewExecution(BatchTaskEngine.java:341)
    at org.dkpro.lab.engine.impl.BatchTaskEngine.executeConfiguration(BatchTaskEngine.java:235)
    ... 8 more

daxenberger commented 7 years ago

I remember having encountered this issue a while ago: https://github.com/dkpro/dkpro-lab/issues/42 Maybe the fix does not apply here?

Johannes

Von: Richard Eckart de Castilho notifications@github.com Antworten an: dkpro/dkpro-tc reply@reply.github.com Datum: Dienstag, 16. Mai 2017 um 15:42 An: dkpro/dkpro-tc dkpro-tc@noreply.github.com Cc: Johannes Daxenberger daxenberger@ukp.informatik.tu-darmstadt.de, Mention mention@noreply.github.com Betreff: Re: [dkpro/dkpro-tc] Replacing the feature store (#403)

I assume you have this structure? I thought that should work.

OuterBatchTask {

SubTaskA { produces: X }

InnerBatchTask {

 SubTaskB { import: SubTaskA.X }

}

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/dkpro/dkpro-tc/issues/403#issuecomment-301785979, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AK99zcwM6Iu2CzNZV73S_skGnixG64F8ks5r6afCgaJpZM4MqsLb.

reckart commented 7 years ago

Why are there two MetaTask executions?

Resolved context [InitTask-TwentyNewsgroupsCV-20170516155643982] not in scope [MetaInfoTask-TwentyNewsgroupsCV-20170516155647285]

Looks like you are using timestamps now - so 20170516155647285 is later than 20170516155643982 - I wonder why the import did not resolve to the later one...

Horsmann commented 7 years ago

@reckart @daxenberger The meta task is in the inner task. Each CV-fold has it own one as it seems. So each fold should execute a Meta task. The issue Johannes linked looks like the problem into which I am running here? I am not really sure how to proceed here :/

reckart commented 7 years ago

I am not sure I have the time atm to look into this. Is there some kind of "check out and run" unit test which provokes this?

Horsmann commented 7 years ago

@reckart Yes. I prepared the setup that cases this error with the commit before this post.

Checkout branch issue403 and run WekaTwentyNewsgroupsDemo (example not a junit test) Located in dkpto-tc-examples/org.dkpro.tc.examples.single.document/WekaTwentyNewsgroupsDemo

Should quickly afterwards fail with aforementioned import exception.

reckart commented 7 years ago

I think I found the problem in DKPro Lab. Cf. https://github.com/dkpro/dkpro-lab/issues/101.

reckart commented 7 years ago

Ok, this should work now.

Horsmann commented 7 years ago

I added new task into the pipeline. CollectionTask - which is a MetaCollection in a sense but the name is already in use. The collection task runs over all data, train and test, to collect the outcomes.

Other MLA do some more or less expensive operations to fish-out the outcomes from the training files for mapping things. Seems overdue that this information is centrally provided. I am not very attached to the name of the task - if you have ideas for better names :) ...

Horsmann commented 7 years ago

merged. Our Jenkins did build.

Horsmann commented 7 years ago

@daxenberger can you test a bit with the current master? I think things look good from my side so far. btw. are there any (larger) multi-label classification data sets you could give away? The multi-label part has only one test-case at the moment.

daxenberger commented 7 years ago

@Horsmann any particular thing you want me to test?

are there any (larger) multi-label classification data sets you could give away?

See http://meka.sourceforge.net (bottom)

Horsmann commented 7 years ago

@daxenberger At the moment a general test for functionality. The tests are passing, so I assume everything should work but since the test cases do not cover everything I would be good if you could run 0.9.0 experiments on the snapshot and see what happens.

Speed-wise, because we have a new task now to collect the outcomes, we did not win much but the scaleability has certainly increased.

dkpro / dkpro-tc

Replacing the feature store #403