dkpro / dkpro-tc

UIMA-based text classification framework built on top of DKPro Core and DKPro Lab.
https://dkpro.github.io/dkpro-tc/
Other
34 stars 19 forks source link

Machine Learning Adapter should be defined in a dimension #435

Closed Horsmann closed 6 years ago

Horsmann commented 6 years ago

The MLA to be used should be defined as dimension. This would allow running several classifiers, not just parametrization of the same classifier.

This is essentially already possible by using an outer-for-loop around the TC experiment but why so complicated?! :)

Furthermore, it would be preferable if a more elegant solution would be found than using these .class.getName() calls, not just for the MLA, also for the classifer-type (SVM, RandomTree, etc.). A simple instantiation of an object would be more Java-ish and probably easier to use than referencing the class names.

i.e.

 Dimension<List<String>> dimClassificationArgs = 
           Dimension.create(DIM_CLASSIFICATION_ARGS,
                asList(new String[] {WekaAdapter.class.getName(), SMO.class.getName(), "-C", "1.0" }),
               asList(new String[] {LiblinearAdapter.class.getName(),"-s", "1",  "-C", "1.0" }),
               asList(new String[] {LibsvmAdapter.class.getName(),"-s", "1",  "-C", "10.0" }),
              ....
            ));
Horsmann commented 6 years ago

@reckart Is it possible in Lab to initialize a new Lab-Task when the experiment is already running?

I try to create a facade-task which then executes in its execute() method the actual machine learning task. This requires that the actual machine learning task is instantiated pretty late during execution of the facade task. I want to provide all discriminators the facade task has to the instantiate task.

I am trying to do this at the moment but the @Discriminator-space is not available in the newly created task, i.e. everything is null

public class DkProTcShallowTestTask extends ExecutableTaskBase implements Constants {

    @Discriminator(name = DIM_CLASSIFICATION_ARGS)
    protected List<Object> classArgs;

    List<ReportBase> reports = new ArrayList<>();

    public DkProTcShallowTestTask() {

    }

    @Override
    public void execute(TaskContext aContext) throws Exception {

        if(classArgs==null || classArgs.size() <= 0){
            throw new IllegalArgumentException("Dimension ["+DIM_CLASSIFICATION_ARGS+"] expected but was not found");
        }

        TcShallowLearningAdapter adapter = (TcShallowLearningAdapter) classArgs.get(0);
        ExecutableTaskBase testTask = adapter.getTestTask();
        testTask.initialize(aContext);

        Map<String, String> imports = getImports();
        for (String k : imports.keySet()){
            testTask.addImport(k, imports.get(k));
        }

        Map<String, String> descriminators = getDescriminators();
        for(String k : descriminators.keySet()){
            testTask.setDescriminator(k, descriminators.get(k));
        }

        Map<String, String> resolvedDescriminators = testTask.getResolvedDescriminators(aContext);

        testTask.addReport(adapter.getOutcomeIdReportClass());

        testTask.execute(aContext);
    }

}
reckart commented 6 years ago

The configuration of tasks happens e.g. here in the BatchTaskEngine:

org.dkpro.lab.engine.impl.BatchTaskEngine.executeConfiguration(BatchTask, TaskContext, Map<String, Object>, Set)

        // Configure subtasks
        for (Task task : aConfiguration.getTasks()) {
            aContext.getLifeCycleManager().configure(aContext, task, aConfig);
        }
Horsmann commented 6 years ago

Thanks. The discriminators are known, now. Next problem:

I start and execute the new task like this:

    TaskExecutionService execService = aContext.getExecutionService();
        TaskExecutionEngine engine = execService.createEngine(testTask);
        String run = engine.run(testTask);

This works so far, my problem is later on when the reports are executed I want to iterate all executed tasks:

for (TaskContextMetadata subcontext : getSubtasks()) {
   //do something
}

This getSubtasks list does not contain my newly created, executed task. How do I inject my new task so that this method is aware of this newly executed task?

Horsmann commented 6 years ago

@reckart So my problem is more that the Discriminator.txt that is written by the EvaluationTask, which holds all subtasks, is not aware of my newly created task. I am looking for a way how to inject the name of the additional task I am running to be written to the disc with all other tasks names.

This happens imo at the moment at BatchTaskEngine.run(Task t) at this line: cfg.setAttribute(SUBTASKS_KEY, executedSubtasks.toString());

This are of the code is fully capsuled and I don't really find a way how to sneak in my new task name into this executedSubtasks set.

reckart commented 6 years ago

BatchTasks maintain a "scope" (which is essentially a list of executed task IDs):

                    // set scope here so that the inherited scopes are considered
                    // set scope here so that tasks added to scope in this loop are considered
                    if (task instanceof BatchTask) {
                        ((BatchTask) task).setScope(scope);
                    }

... but the SUBTASKS_KEY is afaik not percolated upwards - it is maintained and written only per each BatchTask.

Why do you want to sneak it in?

Horsmann commented 6 years ago

The reports needs to have access to the executed tasks, i.e. know their name. This is done by reading the Discriminators.txt. The new task I executed is not contained in this list that is written to the Discriminator.txt, which means the reports don't find it and eventually crash.

The whole effort comes from trying to make the machine learning adapter part of the dimension declaration: i.e. asList(new String[] {LiblinearAdapter.class.getName(),"-s", "1", "-C", "1.0" }), instead of asList(new String[] {"-s", "1", "-C", "1.0" }), The former would allow specifying X-experiments with different classifiers, which is not possible at the moment.

At the moment, the Adapter name is provided at start up, which limits an experiment to only execute one machine learning adapter at a time. This limitation comes from not having the Lab Discriminators available in the ExperimentTrainTest task which wires the experiment together, i.e. I have to provide here the task of the machine learning TestTask I want to execute later; the adapter has to be known at wiring time. This lead to the idea of an "dummy" task that serves as a facade-task. This facade-task executes later on the actual task which I did not now at experiment start up. This worked so far, but now the reports bite me once again, because the task executed by the facade-task slips through the recording mechanism and does not show up in the Discriminator.txt. The reports have to locate this task

I think I could rewrite all reports to make it work again but this would certainly lead to a maintenance catastrophe if I would introduce a ghost-task that is there and executed, but never shows up in the discriminator.txt of the controlling task...

reckart commented 6 years ago

I think your facade task needs to implement BatchTask/extend DefaultBatchTask - because it has child tasks - even if it is only one child task. That should also save you from managing all the metadata manually.

reckart commented 6 years ago

Your reports can recursively inspect the subtasks executed by nested batch tasks in order to get a full overview.

Horsmann commented 6 years ago

Ah, thanks. This works :)

Horsmann commented 6 years ago

@reckart I had to game DKPro Lab a bit to make it do what I want. This new task I created, it essentially forces a re-initialization every time it is executed and delete the list of sub-tasks at each re-init to push in a new machine learning tasks I get from the dimension. https://github.com/dkpro/dkpro-tc/blob/master/dkpro-tc-core/src/main/java/org/dkpro/tc/core/task/DKProTcShallowTestTask.java

Any thoughts on this, something that might cause problems later on which is not immediately visible? Its probably not lab-like to do what I am doing there?

reckart commented 6 years ago

As far as I can see, isInitialized() is only called by DefaultLifeCycleManager.destroy(...) basically to check if destroy() has been called. I would expect your code works even without overriding it.

reckart commented 6 years ago

I think this approach is ok. It is a "feature" of DKPro Lab that the task structure is not predefined but rather change dynamically while the experiment is executed. This does not allow to pre-calculate the execution graph but it allows doing the kind of trick you are pulling off here.

Horsmann commented 6 years ago

splendid :D.