dkpro / dkpro-tc

UIMA-based text classification framework built on top of DKPro Core and DKPro Lab.
https://dkpro.github.io/dkpro-tc/
Other
34 stars 19 forks source link

Enable passing CollectionReaderDescriptions instead of defining reader dimensions #353

Closed Horsmann closed 8 years ago

Horsmann commented 8 years ago

At the moment a user have to define 2 dimensions for a reader. Tc builds in the backend a collection reader description from those information.

i.e.

 dimReaders.put(DIM_READER_TRAIN, Reader.class);
        dimReaders.put(DIM_READER_TRAIN_PARAMS, Arrays.asList(Reader.class,
                languageCode, Reader.PARAM_SOURCE_LOCATION, trainFolder,
                Reader.PARAM_POS_MAPPING_LOCATION, posMapping,
                Reader.PARAM_PATTERNS, "*.txt"));

it would be good if a user could just initialise the collection reader and pass it as parameter to TC/Lab. This would make using TC more similar to using Core at least in the sense of specifying the readers.

Horsmann commented 8 years ago

Change is surprisingly easy just adopting the demos is tedious work ~

@reckart Do you have a code snippet how a collection reader description initialisation has to look like in Groovy?

reckart commented 8 years ago

Puh... the problem is how lab handles discriminator values.

If you do this "naively", then you'd probably in the best case end up storing the whole XML descriptor in a discriminator and thus in the properties file. In the worst case, you'd just store something like a object instance identifier like SomeReader@81def3ce which is not stable across runs. That's both not what you want.

So I was always suggesting to use Discriminables. Cf.

reckart commented 8 years ago

@Horsmann rel. to reader description in Groovy: the same as in Java, no?

Horsmann commented 8 years ago

Is it? I don't know I am not sure if I ever saw a Groovy program that uses DKPro components so I assumed there are some subtle differences? I have to adapt the Groovy test cases to using a the reader description - so any solution would work for me that make the Groovy tests pass again.

reckart commented 8 years ago

See e.g. https://dkpro.github.io/dkpro-core/groovy/recipes/convert-conll2006-to-conll2009/

reckart commented 8 years ago

@Horsmann for curiosity: what is actually stored in the DISCRIMINATORS.txt files for the readers now?

Horsmann commented 8 years ago

Hmm and endlessly long text string (72k characters) that makes scrolling in the document really slow. Is there a way to get a prettier human-redable text?

reckart commented 8 years ago

Well, for instance by naming the reader configurations using a Discriminable... cf. the classes I mentioned above.

If you are sure that the endlessly long string is at least reproducible and consistent (and not subject to potentially random variations, e.g. because iteration order in HashMaps is not defined), you could also consider extending Lab to allow storing a hash of the string instead of the string itself (maybe storing the string outside the DISCRIMINATORS.txt in a separate file for debugging/recording purposes.

reckart commented 8 years ago

That is a proper hash (SHA1 or better), not like Java hashCode() ;) I only have a very rough idea that this should work, it may take a bit working out the details.

Horsmann commented 8 years ago

So the easiest solution would be to make a class that implements Discriminable which gets in the Constructor a CollectionReaderDescription?

Is there an easy way to implement the CollectionReaderDescription it has like thousands methods? Ideally I wont want something like this:

Implement CRD and Discriminable and do:

 CollectionReaderDescriptionExtendedByDiscriminableInterface readerTrain = CollectionReaderFactory.createReaderDescription(
                NERDemoReader.class,
zesch commented 8 years ago

If I understand that correctly, you could subclass CRD and implement Discriminable there. You then only need to return a DiscriminatorValue that is unique and readable.

reckart commented 8 years ago

I think it should be possible using a dynamic proxy in a decorator pattern-like style. I used something similar for the DiscriminableClosure.

I guess it should be possible to implement a generic DiscriminableObject class. It would implement the Discriminable interface and would be used to proxy an instance of any other object that implements interfaces (e.g. CollectionReaderDescription). The result should be a dynamic proxy instance that is both: Discriminable and a CollectionReaderDescription.

reckart commented 8 years ago

Some maybe useful reference:

reckart commented 8 years ago

@zesch I don't think you'd want to try subclassing CRD - UIMA/uimaFIT creates these internally, and you'd want to leave it that way I am pretty sure.

Horsmann commented 8 years ago

@zesch CRD is just an interface you would have to implement it entirely.

@reckart ok thanks I take a look into it. As is I don't think it is much of an actual problem its more a matter of style ~ its pretty ugly.

zesch commented 8 years ago

ok, I probably should not comment on those issues anymore ;)

reckart commented 8 years ago

@Horsmann working with dynamic proxies is something I picked up by working with Spring / JPA / etc. It is not infrequent there.

reckart commented 8 years ago

I think you can make this somewhat less ugly by introducing a HashingDimension which applies this proxy internally to the values it produces (cf. ClosureDimension).

Horsmann commented 8 years ago

@reckart I added a dynamic proxy which seems to work.

Following issues which are no real problems either: I still want some human-readable infos to be written to the Discriminator.txt I tried to iterate the meta of a collection reader to get the names of the parameters to put them also into the string that is written to the text-file but I do not get the values I actually use:

When I create a reader like this:

DiscriminableReaderCollectionFactory.newInstance(BrownCorpusReader.class, BrownCorpusReader.PARAM_LANGUAGE, "en",
                BrownCorpusReader.PARAM_SOURCE_LOCATION, corpusFilePathTrain,
                BrownCorpusReader.PARAM_PATTERNS,
                asList(INCLUDE_PREFIX + "*.xml", INCLUDE_PREFIX + "*.xml.gz")

and print the reader meta data like this:

    ConfigurationParameterDeclarations configurationParameterDeclarations = crd.getCollectionReaderMetaData().getConfigurationParameterDeclarations();
        ConfigurationParameter[] configurationParameters = configurationParameterDeclarations.getConfigurationParameters();
        for(ConfigurationParameter p : configurationParameters){
            System.out.println(p.getName() + "/" + p.getAttributeValue(p.getName()));
        }

I get this output:

useCoarseGrained/null
readToken/null
readPOS/null
readLemma/null
readSentence/null
useXmlId/null
useFilenameId/null
omitIgnorableWhitespace/null
POSMappingLocation/null
POSTagSet/null
sourceLocation/null
patterns/null
useDefaultExcludes/null
includeHidden/null
language/null

any ideas why?

reckart commented 8 years ago

Try using org.apache.uima.fit.factory.ConfigurationParameterFactory.getParameterSettings(ResourceSpecifier).

reckart commented 8 years ago

I think the proxy approach should allow not having to change the type of the variables from CollectionReaderDescription to DiscriminableReaderDescription. The dynamic proxy should continue to exhibit the interfaces implement by the wrapped object - the idea was just to add another interface via the dynamic proxy, namely Discriminable.

Horsmann commented 8 years ago

@reckart That does work but I changed it in a way that you have to use the CollectionReaderDescription that has Discriminable i.e.. calling

  Object readerTrain = DiscriminableReaderCollectionFactory.createReaderDescription(
                TextReader.class, TextReader.PARAM_SOURCE_LOCATION, corpusFilePathTrain,
                TextReader.PARAM_LANGUAGE, LANGUAGE_CODE, TextReader.PARAM_PATTERNS,
                TextReader.INCLUDE_PREFIX + "*/*.txt");
        dimReaders.put(DIM_READER_TRAIN, readerTrain);
Horsmann commented 8 years ago

I didn't see much sense in making it optionally?

reckart commented 8 years ago

I was imagining something like this there the dimension handles all the wrapping and making things discriminable:

Dimension<CollectionReaderDescription> dimReader = new ReaderDimension(
   createReaderDescription(
                TextReader.class, TextReader.PARAM_SOURCE_LOCATION, corpusFilePathTrain,
                TextReader.PARAM_LANGUAGE, LANGUAGE_CODE, TextReader.PARAM_PATTERNS,
                TextReader.INCLUDE_PREFIX + "*/*.txt"),
   createReaderDescription(
                TextReader.class, TextReader.PARAM_SOURCE_LOCATION, corpusFilePathTrain,
                TextReader.PARAM_LANGUAGE, LANGUAGE_CODE, TextReader.PARAM_PATTERNS,
                TextReader.INCLUDE_PREFIX + "*/*.txt"),
   ...);
reckart commented 8 years ago

The way you presently do it is basically using encapsulation - for that case, you wouldn't even need a dynamic proxy I believe.

reckart commented 8 years ago

What I mean is that your present approach requires introducing and calling the getReaderDescription() method whereas I was suggesting to use a dynamic proxy which directly exhibits the CollectionReaderDescription interface such that you can call all the normal methods on it and in addition it would be an instance of Discriminable such DKPro Lab recognizes it.

Horsmann commented 8 years ago

I had that - but this makes using the "Discriminable" Collection Reader Description optional - you could still throw in a normal CollectionReaderDescription and it would work. I thought this shouldn't be possible. The whole effort was made to get a readable string in the Discriminator.txt?

reckart commented 8 years ago

If you want the compiler to ensure that the reader is discriminable, then your approach is certainly better. I approached this thinking from another direction where I have some object and want to give it a name without changing all the code that actually uses this object.

Horsmann commented 8 years ago

And I am not sure if I understand how to implement the ReaderDimension you suggested ... I inherit a class from Dimension and my proxy catches this dimension then i.e. every time Dimension is used?

reckart commented 8 years ago

The way I did it in ClosureDimension is to wrap the objects being passed to the dimension internally in the proxy, such that the objects returned from the dimension would all be both, Closures and Discriminable. I did it in the constructor, but it would also be possible to do it e.g. in next().

    /**
     * Create a new dimension of closures. These closures directly injected into the parameter
     * space - they are not called (cf. {@link ClosureDimension#ClosureDimension(String, Closure...)}).
     * The key associated with each closure is the value used when the closure used as a 
     * discriminator.
     */
    public ClosureDimension(String aName, Map<String, Closure> aClosures)
    {
        super(aName);
        closures = new Closure[aClosures.size()];
        int i = 0;
        for (Entry<String, Closure> e : aClosures.entrySet()) {
            closures[i] = new DiscriminableClosure(e.getValue(),e.getKey());
            i++;
        }
    }
reckart commented 8 years ago

In your case, you'd not use a map as parameter, but instead generate the key from the reader description (in the way that you have already implemented).

Horsmann commented 8 years ago

I still don't get how to implement this ReaderDimensions(). A dimension has a name that is passed to its super class where are the parameters coming from?

reckart commented 8 years ago

Consider something like

public ReaderDimension(String aName, CollectionReaderDescription... aReaders)
    {
        super(aName);
        readers = new CollectionReaderDescription[aReaders()];
        int i = 0;
        for (CollectionReaderDescription r : aReaders) {
            readers[i] = Proxy.newProxyInstance(classloader, new Class[] { CollectionReaderDescription, Discriminable }, new ReaderInvokationHandler(aReader) );
            i++;
        }
    }
reckart commented 8 years ago

Updated last comment to use an actual proper proxy...

Horsmann commented 8 years ago

Thanks. I still don't understand why you would treat all ReaderDescription together?

I still have to assign the readers to a dimension i.e. TRAIN_READER or TEST_READER

So I would want to do something like:

 ReaderDimension trainDim = new ReaderDimension(DIM_READER_TRAIN, CollectionReaderFactory.createReaderDescription(BrownCorpusReader.class, BrownCorpusReader.PARAM_LANGUAGE, "en",
                BrownCorpusReader.PARAM_SOURCE_LOCATION, corpusFilePathTrain,
                BrownCorpusReader.PARAM_PATTERNS,
                asList(INCLUDE_PREFIX + "*.xml", INCLUDE_PREFIX + "*.xml.gz")));

But this seems a bit redundantly because I now still would have to explicitly put it into the dimension map with the DIM_READER_TRAIN key

dimReaders.put(DIM_READER_TRAIN, trainDim);

For this to make really sense and be different from what I have now I would want the InitTask to fail if you put in an ordinary reader.

i.e. that what I have right now is essential the same, no?:

   Object newInstance = DiscriminableReaderCollectionFactory.createReaderDescription(BrownCorpusReader.class, BrownCorpusReader.PARAM_LANGUAGE, "en",
                BrownCorpusReader.PARAM_SOURCE_LOCATION, corpusFilePathTrain,
                BrownCorpusReader.PARAM_PATTERNS,
                asList(INCLUDE_PREFIX + "*.xml", INCLUDE_PREFIX + "*.xml.gz"));
        dimReaders.put(DIM_READER_TRAIN, newInstance);
reckart commented 8 years ago

I may be thinking about this from a slightly wrong angle in terms of how it fits into DKPro TC because I don't see this atm in a concrete example/experiment.

Horsmann commented 8 years ago

oh ok.

In a scenario where you have to provide train/test as separate instances: Is there a way to guarantee that a collectionReaderDesc is discriminable without enforcing to use a new data type? The way how it is now makes it optional, but causes no real problems either unless you are concerned about the DISCRIMINATOR.txt

reckart commented 8 years ago

Without looking at how TC presently works, I would be assuming/thinking/imagining the following:

zesch commented 8 years ago

We definitely need two dimensions as train and test might be totally different corpora. Richard Eckart de Castilho notifications@github.com schrieb am Sa., 14. Mai 2016 um 10:19:

Without looking at how TC presently works, I would be assuming/thinking/imagining the following:

  • either I supply one reader dimension (e.g. DIM_READER) and TC internally splits the data into train and test without me intervening at all
  • or there would be two different dimensions that I need to supply, e.g. DIM_READER_TRAIN and DIM_READER_TEST and I would set them independently. In this case, if I wanted to run my experiment on multiple data sets and always would have to set fixed pairs of train/test readers, then I would use a DimensionBundle to group them together.

— You are receiving this because you were mentioned.

Reply to this email directly or view it on GitHub https://github.com/dkpro/dkpro-tc/issues/353#issuecomment-219207831

Horsmann commented 8 years ago

What we could do to enforce a Discriminable CRD is to let the factory DiscriminableReaderCollectionFactory.createReaderDescription return a Discriminable. The InitTask would then expect a Discriminable too but would cast it to a CRD which of course only works if the DiscriminableReaderCollectionFactory is used.

        Discriminable readerTrain = DiscriminableReaderCollectionFactory.createReaderDescription(
                ReutersCorpusReader.class, ReutersCorpusReader.PARAM_SOURCE_LOCATION,
                FILEPATH_TRAIN, ReutersCorpusReader.PARAM_GOLD_LABEL_FILE, FILEPATH_GOLD_LABELS,
                ReutersCorpusReader.PARAM_LANGUAGE, LANGUAGE_CODE,
                ReutersCorpusReader.PARAM_PATTERNS, ReutersCorpusReader.INCLUDE_PREFIX + "*.txt");
        dimReaders.put(DIM_READER_TRAIN, readerTrain);
////////////////////////
public class InitTask
    extends UimaTaskBase
{

    @Discriminator(name=DIM_READER_TRAIN)
    protected Discriminable readerTrain;
    @Discriminator(name=DIM_READER_TEST)
    protected Discriminable readerTest;
...
    readerDesc = (CollectionReaderDescription) readerTrain;
}

This would ensure that you have a discriminable reader and it is not working to throw in a normal CollectionReaderDescription.

I think this is a bit more elegant than what I had before.

@zesch @daxenberger opinions on that?

reckart commented 8 years ago

I would personally prefer not bothering the user with having to use a special factory that creates discriminables, rather ensure that DKPro TC internally wraps reader descriptions accordingly whenever the user supplies non-discriminable descriptions. I feel this should be an internal implementation detail.

reckart commented 8 years ago

Btw. it may even be possible to not fiddle around with the dimensions at all and let the task do the wrapping.

If I remember correctly, then DKPro Lab does

  1. configure the tasks with the current parameter space values
  2. then ask the task to persist the values in the DISCRIMINATOR.txt - in this step, the task inspects itself to obtain the values using getters

I think it should be possible to do a transparent wrapping by implementing/overriding a getReader() method in the respective task and return a discriminable reader at that point.

Didn't really test it though.

Cf.

Horsmann commented 8 years ago

Okay I thought too complicated it is quite easy to do it hidden in the initTask as you said - thx.

Horsmann commented 8 years ago

@reckart The DISCRIMINATOR.txt still has an extremely long string i.e. not the value my Discriminable interface returns. I currently wrap the reader descriptions in the InitTask this might be a bit too late? When the DISCRIMINATOR.txt written? Also, the Excel files create contain the Evaluationresults also get this endless long text.

It looks like the right place would be somewhere in the Lab code. I think all I can do is going back to defining a new type and force the user to use this type right from the start?

reckart commented 8 years ago

Is there some unit test where this can easily be observed/debugged?

Horsmann commented 8 years ago

Any demo should show that problem

org.dkpro.tc.examples.single.sequence.CRFSuiteBrownPosDemo

reckart commented 8 years ago

I tried an override of the initialize(TaskContext) method in InitTask and did the wrapping there. This works in the sense that the discriminable value is stored properly in the DISCRIMINATOR.txt file, but then the imports break because in the next task that imports from the InitTask, the descriptor is again not wrapped.

This quite effectively shows that values need already be discriminable within the parameter space, e.g. through a dimension that wraps the values.

I think the next thing I would go for is to extend DKPro Lab to support registering global discriminables, e.g. something like

pSpace.registerDiscriminable(
    CollectionReaderDescription.class, desc -> makeDiscriminable(desc));

or creating a new "conversionService" and using something like

taskContext.getConversionService().registerDiscriminable(
    CollectionReaderDescription.class, desc -> makeDiscriminable(desc));

... not sure what would be the best location to put this.

reckart commented 8 years ago

Allowing to register special discriminator handlers for specific classes in a suitable place in DKPro Lab could even entirely remove the need to have them implement the Discriminable interface. The Lab could know if they are discriminable by looking in its internal registry of discriminables instead and looking up a proper naming function instead of using the Discriminable interface methods.

Horsmann commented 8 years ago

I think I like the registerDiscriminable() :-) Would it take much time to implement that?

reckart commented 8 years ago

It would probably take a few attempts or at least some thinking it through to consider all the requiremnts, find a suiteable place to implement this, and then ensure all the necessary information is available at that location.

I would personally tend towards registering the discriminables in the Lab instance through the TaskContext and then using that information again in Task.analyze(),