dkpro / dkpro-tc

UIMA-based text classification framework built on top of DKPro Core and DKPro Lab.
https://dkpro.github.io/dkpro-tc/
Other
34 stars 19 forks source link

Enable passing CollectionReaderDescriptions instead of defining reader dimensions #353

Closed Horsmann closed 8 years ago

Horsmann commented 8 years ago

At the moment a user have to define 2 dimensions for a reader. Tc builds in the backend a collection reader description from those information.

i.e.

 dimReaders.put(DIM_READER_TRAIN, Reader.class);
        dimReaders.put(DIM_READER_TRAIN_PARAMS, Arrays.asList(Reader.class,
                languageCode, Reader.PARAM_SOURCE_LOCATION, trainFolder,
                Reader.PARAM_POS_MAPPING_LOCATION, posMapping,
                Reader.PARAM_PATTERNS, "*.txt"));

it would be good if a user could just initialise the collection reader and pass it as parameter to TC/Lab. This would make using TC more similar to using Core at least in the sense of specifying the readers.

Horsmann commented 8 years ago

ok. I it is not urgent. Things are working and you don't notice the issue unless you look into the text files.

Horsmann commented 8 years ago

@reckart Did you had time to think about this issue?

reckart commented 8 years ago

@Horsmann not really. But I'm happy to accept pull requests or even provide commit rights ;)

Horsmann commented 8 years ago

@reckart I think I will need a more pointers to address this issue. Which class is dealing at the moment with the discriminables. I do not really find a spot where I could start. Any suggestions ?

reckart commented 8 years ago

The method that turns an object into a string which is used in the discriminators file is this one: org.dkpro.lab.Util.toString(Object)

Horsmann commented 8 years ago

How is your idea with the conversionService supposed to work. In the initialization method of the Task(?) all expected types are set to a fixed mapping function something that would call getDiscriminatorValue or is this something the user is supposed to by hand?

By the timing I have to have all information in Lab available by the time analyze(Class<?> aClazz, Class<? extends Annotation> aAnnotation, Map<String, String> props) is called. My current hack is called way to late. Roughly speaking I would want to move the content of the analyze method somewhere in the initialization phase - this is probably where it should happen?

reckart commented 8 years ago

(Trying to remember)... I think my idea was that the conversion service would be part of the Lab instance, not of a task, e.g.

Lab lab = Lab.getInstance();
lab.getConversionService().registerDiscriminable(...);

But instead of always having to go through the static Lab.getInstance(), within a task the conversion service should be obtained from the task context. Registering a conversion should probably happen in the same place where the Lab instance is initially obtained and where Lab.run(...) is called. The service would be created through the context.xml - just as the other services (e.g. lifecycleservice, etc.).

For a more sophisticated solution, I suppose it would be possible to instantiate a Spring Conversion service to handle that... or at least be inspired of how it works. - Btw. that is what uimaFIT is currently using for parameter value conversion.

Horsmann commented 8 years ago

Do I understand it right that I would have to manually call this registerDiscriminable for every project/experiment? Essentially before I call Lab...run() I would have to know that I have to set those global-overrides for the train/test reader?!

This is even more ugly than wrapping the CollectionReaderFactory with a TcCollectionReaderFactory which returns only CRDs where the DynamicProxy is already set. This would allow a user still to ignore the TcCollectionReaderFactory and use the normal CollectionReaderFactory method version resulting in status quo.

reckart commented 8 years ago

The discriminables would have to be registered before the first batch task runs. So I would see three options:

I agree that the last option is ugly, but IMHO it would be the first step on top of which one or both others could be implemented.

Horsmann commented 8 years ago

@reckart I am drafting such a service atm defining it in the context.xml and loading it as service as you suggest. So far I understand things here accessing this conversionService then does require a TaskContext.

Lab does not have such a context when I would need it for accessing the textual information. For fixing the problem of this issue, I need in the method protected void analyze(Class<?> aClazz, Class<? extends Annotation> aAnnotation, Map<String, String> props) which is located TaskBase access to this service.

How do I access such a service from within the TaskBase?

Horsmann commented 8 years ago

@reckart I tried to hack something and bumped into yet another problem. The discriminators are stored in a Map<String,String> which becomes a problem for the CollectionReaderDescriptions. When I hack in my conversion the endlessly verbose description text of the CRDs cause problems with some regEx checks in ImportUtils in the method matchConstraints(Map<String, String> aDiscriminators, Map<String, String> aConstraints, boolean aStrict)

name = org.dkpro.tc.api.type.TextClassificationTarget
supertypeName = uima.tcas.Annotation

}

vendor = NULL
version = NULL

vendor = DKPro Core Project
version = 1.9.0-SNAPSHOT

resourceManagerConfiguration = NULL
$
                                                                                                              ^
    at java.util.regex.Pattern.error(Pattern.java:1955)
    at java.util.regex.Pattern.closure(Pattern.java:3141)
    at java.util.regex.Pattern.sequence(Pattern.java:2134)
    at java.util.regex.Pattern.expr(Pattern.java:1996)
    at java.util.regex.Pattern.compile(Pattern.java:1696)
    at java.util.regex.Pattern.<init>(Pattern.java:1351)
    at java.util.regex.Pattern.compile(Pattern.java:1028)
    at java.util.regex.Pattern.matches(Pattern.java:1133)
    at org.dkpro.lab.engine.impl.ImportUtil.matchConstraints(ImportUtil.java:56)
    at org.dkpro.lab.storage.filesystem.FileSystemStorageService.getContexts(FileSystemStorageService.java:141)
    at org.dkpro.lab.engine.impl.BatchTaskEngine.getLatestExecution(BatchTaskEngine.java:297)
    at org.dkpro.lab.engine.impl.BatchTaskEngine.getExistingExecution(BatchTaskEngine.java:360)
    at org.dkpro.lab.engine.impl.BatchTaskEngine.executeConfiguration(BatchTaskEngine.java:223)
    at org.dkpro.lab.engine.impl.BatchTaskEngine.run(BatchTaskEngine.java:134)
    ... 4 more

A large part of the whole Lab constraint checking seems to be string based. Any thoughts on how to deal with that?

Horsmann commented 8 years ago

@reckart I have issus with initializing the conversion service that I defined in the context.xml. I am missing somewhere some init step but I don't get where. When I launch a CrossValidationExperiment the ExperimentCrossValidation Task is properly initialised with the ConversionService being not null. Once the first InitiTask runs the ConversionService is null so it seems that for the subtasks the initialization is not performed. Where is that suppose to happen

reckart commented 8 years ago

@Horsmann is https://github.com/dkpro/dkpro-tc/issues/353#issuecomment-246180878 still an issue for you?

Horsmann commented 8 years ago

no, this should be ok now.