dkpro / dkpro-tc

UIMA-based text classification framework built on top of DKPro Core and DKPro Lab.
https://dkpro.github.io/dkpro-tc/
Other
34 stars 19 forks source link

Add a learning curve experiment #516

Closed Horsmann closed 5 years ago

Horsmann commented 6 years ago

Introduce a new experiment type which runs a learning curve experiment.

The experiment should support large and small sized experiments, i.e. averaging all combinations on a lc-stage might become extremely expensive for large Ns, so a small-sized version that skips some steps is necessary. The smaller version lead to a less steady curve but runs a lot faster for larger datasets.

reckart commented 6 years ago

@Horsmann I'm curious: at what granularity does the report work? CAS-by-CAS or is the control more fine-grained, e.g. sentence-by-sentence?

Horsmann commented 6 years ago

@reckart This is based on the CAS bucket system that is also used for the regular cross validation. It should be possible to use the cas-multiplier to force this down to sentence-by-sentence but this is not yet supported. Alternatively, if your reader creates for each sentence an own CAS this should be useable as-is, once fully implemented.

Horsmann commented 6 years ago

@reckart Lab creates in the folded dimension buckets with the training data in it. Is it somehow possible to provide this information (which bucket-ids the current training sets is based on) in the tasks? I am currently have the problem that I loose track which run is based on which set of training buckets. This makes it impossible to track which learning curve runs belong together. This information exists in the Lab dimension but I am not sure how to access it later on in a report~

reckart commented 6 years ago

The FoldDimensionBundle seems to write the information into the parameter space, so you should be able to pick it up from there:

org.dkpro.lab.task.impl.FoldDimensionBundle.current()

        data.put(getName()+"_training", trainingData);
        data.put(getName()+"_validation", buckets[validationBucket]);

Actually, if the task subscribes to the value as a Discriminator, then it should also be written to the corresponding file in the run folder.

Horsmann commented 6 years ago

The bucketing is resolved at the point in the code you pasted. I have all .bin files later on but I do not know to how many buckets the files correspond. I think I will have to introduce a new information piece at this point in code and import a new dimension in the MLA adapters. i.e.

     Map<String, Collection<String>> data = new HashMap<String, Collection<String>>();
        data.put(getName() + "_training", trainingData);
        data.put(getName() + "_validation", buckets[learningCurveRun.test]);
        data.put(getName() + "_numTrainingFolds", bucketCount);

This requires an (unused) import in each MLA TestTask, just to make this bucketing information available during reporting. This works probably but is also a pretty ugly hack imo.

reckart commented 6 years ago

Hm. Why do you need an import? I'd say an @Attribute (i.e. not even a @Discriminator) should be enough, no? After all, this is metadata and not a file that you would need to import.

Also, trainingData and buckets[...] both return lists with the details about what goes into what. Is that not sufficient? If you need a temporal ordering of the buckets, I suppose you could e.g. sort the runs by their folder creation time. That said, it probably wouldn't hurt to add something like getName()+"_maxCount" and getName()+"_currentCount" - I suppose getName()+"_maxCount" would be equal to the bucket count.

reckart commented 6 years ago

@Horsmann another option would be to allow @Attribute annotations in reports and have the TaskBase check for attributes in reports in its analyze() method.

Horsmann commented 6 years ago

@reckart This sounds a lot cleaner that hacking this information into each MLA task. Where/How do I make this option available?

reckart commented 6 years ago

@Horsmann probably you just loop over the reports and call

        analyze(report.getClass(), Property.class, analyzedAttributes);

on them in TaskBase.analyze() (sorry, I said @Attribute before, but it seems to be @Property).

Horsmann commented 6 years ago

Hm, the information is in the Discriminator.txt so I added the Discriminators; but the type-cast magic fails for some reasons.

in taskbase I have

    @Override
    public final void analyze()
    {
        analyzedAttributes = new HashMap<String, String>();
        analyzedDiscriminators = new HashMap<String, String>();
        analyze(getClass(), Property.class, analyzedAttributes);
        analyze(getClass(), Discriminator.class, analyzedDiscriminators);

        if (reports != null) {
            for (Report r : reports) {
                analyze(r.getClass(), Discriminator.class, analyzedDiscriminators);
            }
        }
    }

in my report its now:

@Discriminator(name = DIM_NUM_TRAINING_FOLDS)
    private Collection<String> trainFolds;
Exception in thread "main" java.lang.IllegalArgumentException: Can not set java.util.Collection field org.dkpro.tc.ml.report.LearningCurveReport.trainFolds to org.dkpro.tc.ml.ExperimentLearningCurve
    at sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:167)
    at sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:171)
    at sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:58)
    at sun.reflect.UnsafeObjectFieldAccessorImpl.get(UnsafeObjectFieldAccessorImpl.java:36)
    at java.lang.reflect.Field.get(Field.java:393)
    at org.dkpro.lab.task.impl.TaskBase.analyze(TaskBase.java:461)
    at org.dkpro.lab.task.impl.TaskBase.analyze(TaskBase.java:124)
    at org.dkpro.lab.engine.impl.DefaultLifeCycleManager.initialize(DefaultLifeCycleManager.java:94)
    at org.dkpro.lab.engine.impl.BatchTaskEngine.run(BatchTaskEngine.java:92)
    at org.dkpro.lab.engine.impl.DefaultTaskExecutionService.run(DefaultTaskExecutionService.java:52)
    at org.dkpro.lab.Lab.run(Lab.java:113)

I am not sure what is causing the problem.

Horsmann commented 6 years ago

@reckart

I am not sure why retrieving the value of the discriminator fails

    protected void analyze(Class<?> aClazz, Class<? extends Annotation> aAnnotation, Map<String, String> props)
    {
        if (aClazz.getSuperclass() != null) {
            analyze(aClazz.getSuperclass(), aAnnotation, props);
        }

        for (Field field : aClazz.getDeclaredFields()) {
            field.setAccessible(true);
            try {
                if (field.isAnnotationPresent(aAnnotation)) {
                    String name;

                    Annotation annotation = field.getAnnotation(aAnnotation);

                    if (StringUtils.isNotBlank(ParameterUtil.getName(annotation))) {
                        name = getClass().getName() + "|" + ParameterUtil.getName(annotation);
                    }
                    else {
                        name = getClass().getName() + "|" + field.getName();
                    }

                    String value = Util.toString(field.get(this)); <-- causes the exception

the name is resolved to the ExperimentLearningCurve, which is the class that wires the tasks together. The problem could be that this information does not yet exist (at init-time), its is created afterwards and not yet available when the analyze code is executed?

reckart commented 6 years ago

What is the value of these fields when they are written to the DISCRIMINATORS.txt/ATTRIBUTES.txt?

reckart commented 6 years ago

And what type is bucketCount? I don't see that currently in the FoldDimensionBundle. Is it simply buckets.length?

Horsmann commented 6 years ago

I am using a new implementation of the FoldDimensionBundle, which is just based on the fold-dimension and does something similar. https://github.com/dkpro/dkpro-tc/blob/5c8245b9f253fde1c8dd67d975141046d58f4ee2/dkpro-tc-ml/src/main/java/org/dkpro/tc/ml/LearningCurveDimensionBundle.java line 187-190 hacks-in the bucket information. This class is at the moment in TC but probably should move into Lab in the future.

At the moment, since i thought it might come in handy later on, I also write the bucket ids is used in a training set. The size determines the number of folds and the names identify the folds. So its a collection of Strings at this point in time like the other information.

What is the value of these fields when they are written to the DISCRIMINATORS.txt/ATTRIBUTES.txt? It does even come to a created file. The error happens when the DefaultLifeCycleManager is in its initializing phase. There are no files at this point. It seems to be a severe problem that the report has an discriminator annotation :/

@Override
    public void initialize(TaskContext aContext, Task aConfiguration)
        throws LifeCycleException
    {
        // Preparation hook for batch task in case it wants to do anything to itself
        // before the subtasks are executed (e.g. adding subtasks or a parameter space)
        aConfiguration.initialize(aContext);

        aContext.message("Initialized task ["+aConfiguration.getType()+"]");

        aConfiguration.analyze();

        aContext.message("Analyzed task configuration ["+aConfiguration.getType()+"]");

        try {
            aConfiguration.persist(aContext);
            aContext.message("Persisted task configuration ["+aConfiguration.getType()+"]");
        }
        catch (IOException e) {
            throw new LifeCycleException(e);
        }
    }
reckart commented 6 years ago

Ok, good you found a working solution :)

Horsmann commented 6 years ago

No I haven’t ^^. Using the annotation in a report leads to an early crash of the entire experiment execution ~

Is there a way to fix the crashing report problem ?

reckart commented 6 years ago

What I can say is that I'm pretty sure that we shouldn't use @Descriminator but @Property instead - reason being that reports should not influence the dependencies between tasks.

Then, I don't understand why the Lab would story a value of type ExperimentLearningCurve as the value of a parameter called DIM_NUM_TRAINING_FOLDS. It seems from the stack trace to me it is rather a problem happening when the value is set in the parameter space, not when the value is retrieved from the parameter space and injected into the task. But the problem is probable related to the below:

Also, when we aggregate the attributes form the task and the report, then during the configuration of the task (org.dkpro.lab.engine.impl.DefaultLifeCycleManager.configure(TaskContext, Task, Map<String, Object>)), the Lab may try to inject attributes defined by the report into the task (which obviously doesn't work). So that code would need to be made more resilient in such a way that if a field does not exist in the task, instead of generating an error, it should simply be ignored. Also, the injection code would need to loop over the reports and try injecting there.

Horsmann commented 6 years ago

Hm, I am not sure how to inject the information into the report. Avoiding the exception during anaylze() is rather easy although then no property is added which is probably fine since the report defines no new values and is supposed to only import existing once.

How/Where do I get access to the outer-reports in a way that I can inject the properties. I am probably looking at the wrong place. the configure() method seems to only work with the reports directly attached to a task. The report in which I import the Property is an outer report for aggregation over every run.

reckart commented 6 years ago

You mean the report in question is attached to the top-level batch task?

Horsmann commented 6 years ago

Yes, I import the property in a report that is attached to the outer/top-level batch task. The task that is the CrossValidationExperiment task in a normal CV setup. Does this change the way how to access this information?

reckart commented 6 years ago

A top-level task cannot have any attributes or discriminators. If you have a top-level BatchTask X with a parameter space P, then X iterators over P and configures all its subtasks accordingly. But The top-level task itself is parameterless.

Horsmann commented 6 years ago

hm, so the original way of making sure that MLA imports the discriminator I need is the only way for this to work? Then I can simply load this information from the MLA Discrimnator directly. Hm, ugly :(. Alternatively, I could add the dimension into FeatureExtraction task but then I have to back-track which FeatureExtraction Task belongs to the MLA I am currenlty processing. All information is stored in Sets so I can't really trust the order of the tasks in which I retrieve them from disc?!

reckart commented 6 years ago

hm, so the original way of making sure that MLA imports the discriminator I need is the only way for this to work? Then I can simply load this information from the MLA Discrimnator directly.

Well, at least you cannot expect injections happening at the level of a batch task which itself does not live in a parameter space. There is simply nothing to inject here. If you add a task under that batch task and attach your report there, that would be another thing. I think I still didn't entirely understand what you want to do. What is a MLA?

If you want to do an aggregating report, it needs to be at the level of a batch task, but of course that report cannot depend on any of the parameters defined within the parameter space of the same batch task - it could only depend on parameters define by a higher-level batch task (although even this is currently not implemented).

An aggregating report should iterate over the list/set of executions that it receives from the batch task and then inspect each execution to obtain the results it wants to aggregate. The order in which subtasks are returned by BatchReportBase.getSubtasks() should correspond to the execution order.

Horsmann commented 6 years ago

Ok, I try to explain:

A full-scale learning curve that splits the data into 10 blocks/folds could look like this:

1st Level - 1 Train Block 1 Test Block

Train     Test
[1]         [0]
[2]         [0]
[3]         [0]
[4]         [0]
...
[0]        [1]
[2]        [1]
[3]        [1]
.... 

(The curve would then average over all of above runs for an average one-train-fold performance)

2nd Level - 2 Train Blocks 1 Test Block

Train     Test
[1, 2]     [0]
[2,3]     [0]
[3,4]     [0]
.....
[0,2]      [1]
[2,3]      [1]
[3,4]      [1]
....

(The curve would then average over all of above runs for an average two-train-fold performance)

and so on .... this is the most expensive setup possible, I know, but as proof of concept I would like to implement it this way, you can have versions with less averaging later on but the underlying problem I have is the following: I have to track which of all executions did use exactly 1 fold for training (to find all 1-fold runs), which did use 2, 3, 4, ...N fold for training. Otherwise I cannot correctly average the results. I only have the actual *.bin files used but I lost the information to how many folds this M-.bin files correspond.

I try to hack lab in a way that I can pipe through this information. In this new dimension I build, where the buckets are created - this information is available. As-is, the bucketing is lost but this information is needed at the very end in the report that aggregates and averages the data. I need infos that exist very early in the dimension at the very end when the outer-task's reports are executed~

Do you have a better idea how / where to get this information?

reckart commented 6 years ago

Problem 1: the "bucket index" and "max bucket index" are not accessible

I think your solution to have the dimension add these to the parameter space is good. We should integrate this into the default fold dimension.

Problem 2: even if the "bucket index" and "max bucket index" are in the parameter space, they do not end up in the ATTRIBUTES.txt or DISCRIMINATORS.txt on disk

This is because only parameter space values actually injected as @Property or @Discriminator get stored in these files. We could (as we tried) try changing the code such that e.g. reports can have @Property annotations if they need these values. However, this doesn't seem to help in this case because the report is actually not governed by the parameter space as it is an aggregating report. We could try to change BatchReportBase and BatchTaskEngine such that aggregating reports can require subtasks to request certain attributes. That might be an elegant solution, but probably quite a bit of work. A very simple (but overkill) would be to change TaskBase.persist(TaskContext) such that it does not only store the properties/attributes and discriminators to disk, but also the entire configuration from the parameter space. However, normally a task does not have access to the entire configuration, so that is not a generic solution. What you can do in your particular task would be to implement the ConfigurationAware interface and implement setConfiguration(...) such that is dumps the whole parameter space into the tasks execution, e.g. as CONFIGURATION.txt. That way your reports could access the data. But there would still be a coupling between the report and the task via the CONFIGURATION.txt. Would that help?

Horsmann commented 6 years ago

Hi Richard, this sounds like it could work. The information contains the bucket info I need and the implementation of the dimension is already ConfigurationAware . However, I am not sure how I get at this point a file pointer to store the new file at a reasonable location. getContext() is not available within the dimension. I someone have to place this file in the tasks, how do I do that?

This doesn't work, no context:

aContext.storeBinary(ATTRIBUTES_KEY, new PropertiesAdapter(getAttributes(), "Task properties"));
reckart commented 6 years ago

The dimension isn't responsible for storing stuff. It's only responsible for generating parameters. The configuration would need to be saved in the relevant subtasks of your batch task such that the aggregating report on the batch task can pick it up from there.

Horsmann commented 6 years ago

Thanks. This seems to work :). Is this customized new dimension https://github.com/dkpro/dkpro-tc/blob/5c8245b9f253fde1c8dd67d975141046d58f4ee2/dkpro-tc-ml/src/main/java/org/dkpro/tc/ml/LearningCurveDimensionBundle.java be moved to Lab? Right now its in TC. I am not sure if this is something general enough for Lab?

reckart commented 6 years ago

@Horsmann I think the addition of the bucket ID and the bucket count should be added to the FoldDimensionBundle so that you don't need your custom bundle. IMHO it's a pretty generic and useful addition.

Horsmann commented 6 years ago

@reckart I can add this change to the FoldDimensionBundle but the actual LearningCurveDimension does a bit more extra stuff, which I need. Its not just to add the bucket count info but also to do the learning curve splits.

reckart commented 6 years ago

Ok - counting the splits is probably rather specific to the learning curve thing. I thought the number of splits was equal to the number of buckets.