Closed Horsmann closed 8 years ago
Change is surprisingly easy just adopting the demos is tedious work ~
@reckart Do you have a code snippet how a collection reader description initialisation has to look like in Groovy?
Puh... the problem is how lab handles discriminator values.
If you do this "naively", then you'd probably in the best case end up storing the whole XML descriptor in a discriminator and thus in the properties file. In the worst case, you'd just store something like a object instance identifier like SomeReader@81def3ce
which is not stable across runs. That's both not what you want.
So I was always suggesting to use Discriminable
s. Cf.
@Horsmann rel. to reader description in Groovy: the same as in Java, no?
Is it? I don't know I am not sure if I ever saw a Groovy program that uses DKPro components so I assumed there are some subtle differences? I have to adapt the Groovy test cases to using a the reader description - so any solution would work for me that make the Groovy tests pass again.
@Horsmann for curiosity: what is actually stored in the DISCRIMINATORS.txt
files for the readers now?
Hmm and endlessly long text string (72k characters) that makes scrolling in the document really slow. Is there a way to get a prettier human-redable text?
Well, for instance by naming the reader configurations using a Discriminable... cf. the classes I mentioned above.
If you are sure that the endlessly long string is at least reproducible and consistent (and not subject to potentially random variations, e.g. because iteration order in HashMaps is not defined), you could also consider extending Lab to allow storing a hash of the string instead of the string itself (maybe storing the string outside the DISCRIMINATORS.txt
in a separate file for debugging/recording purposes.
That is a proper hash (SHA1 or better), not like Java hashCode() ;) I only have a very rough idea that this should work, it may take a bit working out the details.
So the easiest solution would be to make a class that implements Discriminable
which gets in the Constructor a CollectionReaderDescription
?
Is there an easy way to implement the CollectionReaderDescription
it has like thousands methods?
Ideally I wont want something like this:
Implement CRD
and Discriminable
and do:
CollectionReaderDescriptionExtendedByDiscriminableInterface readerTrain = CollectionReaderFactory.createReaderDescription(
NERDemoReader.class,
If I understand that correctly, you could subclass CRD and implement Discriminable there. You then only need to return a DiscriminatorValue that is unique and readable.
I think it should be possible using a dynamic proxy in a decorator pattern-like style. I used something similar for the DiscriminableClosure.
I guess it should be possible to implement a generic DiscriminableObject class. It would implement the Discriminable interface and would be used to proxy an instance of any other object that implements interfaces (e.g. CollectionReaderDescription). The result should be a dynamic proxy instance that is both: Discriminable and a CollectionReaderDescription.
@zesch I don't think you'd want to try subclassing CRD - UIMA/uimaFIT creates these internally, and you'd want to leave it that way I am pretty sure.
@zesch CRD is just an interface you would have to implement it entirely.
@reckart ok thanks I take a look into it. As is I don't think it is much of an actual problem its more a matter of style ~ its pretty ugly.
ok, I probably should not comment on those issues anymore ;)
@Horsmann working with dynamic proxies is something I picked up by working with Spring / JPA / etc. It is not infrequent there.
I think you can make this somewhat less ugly by introducing a HashingDimension
which applies this proxy internally to the values it produces (cf. ClosureDimension).
@reckart I added a dynamic proxy which seems to work.
Following issues which are no real problems either: I still want some human-readable infos to be written to the Discriminator.txt I tried to iterate the meta of a collection reader to get the names of the parameters to put them also into the string that is written to the text-file but I do not get the values I actually use:
When I create a reader like this:
DiscriminableReaderCollectionFactory.newInstance(BrownCorpusReader.class, BrownCorpusReader.PARAM_LANGUAGE, "en",
BrownCorpusReader.PARAM_SOURCE_LOCATION, corpusFilePathTrain,
BrownCorpusReader.PARAM_PATTERNS,
asList(INCLUDE_PREFIX + "*.xml", INCLUDE_PREFIX + "*.xml.gz")
and print the reader meta data like this:
ConfigurationParameterDeclarations configurationParameterDeclarations = crd.getCollectionReaderMetaData().getConfigurationParameterDeclarations();
ConfigurationParameter[] configurationParameters = configurationParameterDeclarations.getConfigurationParameters();
for(ConfigurationParameter p : configurationParameters){
System.out.println(p.getName() + "/" + p.getAttributeValue(p.getName()));
}
I get this output:
useCoarseGrained/null
readToken/null
readPOS/null
readLemma/null
readSentence/null
useXmlId/null
useFilenameId/null
omitIgnorableWhitespace/null
POSMappingLocation/null
POSTagSet/null
sourceLocation/null
patterns/null
useDefaultExcludes/null
includeHidden/null
language/null
any ideas why?
Try using org.apache.uima.fit.factory.ConfigurationParameterFactory.getParameterSettings(ResourceSpecifier)
.
I think the proxy approach should allow not having to change the type of the variables from CollectionReaderDescription to DiscriminableReaderDescription. The dynamic proxy should continue to exhibit the interfaces implement by the wrapped object - the idea was just to add another interface via the dynamic proxy, namely Discriminable
.
@reckart That does work but I changed it in a way that you have to use the CollectionReaderDescription that has Discriminable
i.e.. calling
Object readerTrain = DiscriminableReaderCollectionFactory.createReaderDescription(
TextReader.class, TextReader.PARAM_SOURCE_LOCATION, corpusFilePathTrain,
TextReader.PARAM_LANGUAGE, LANGUAGE_CODE, TextReader.PARAM_PATTERNS,
TextReader.INCLUDE_PREFIX + "*/*.txt");
dimReaders.put(DIM_READER_TRAIN, readerTrain);
I didn't see much sense in making it optionally?
I was imagining something like this there the dimension handles all the wrapping and making things discriminable:
Dimension<CollectionReaderDescription> dimReader = new ReaderDimension(
createReaderDescription(
TextReader.class, TextReader.PARAM_SOURCE_LOCATION, corpusFilePathTrain,
TextReader.PARAM_LANGUAGE, LANGUAGE_CODE, TextReader.PARAM_PATTERNS,
TextReader.INCLUDE_PREFIX + "*/*.txt"),
createReaderDescription(
TextReader.class, TextReader.PARAM_SOURCE_LOCATION, corpusFilePathTrain,
TextReader.PARAM_LANGUAGE, LANGUAGE_CODE, TextReader.PARAM_PATTERNS,
TextReader.INCLUDE_PREFIX + "*/*.txt"),
...);
The way you presently do it is basically using encapsulation - for that case, you wouldn't even need a dynamic proxy I believe.
What I mean is that your present approach requires introducing and calling the getReaderDescription() method whereas I was suggesting to use a dynamic proxy which directly exhibits the CollectionReaderDescription interface such that you can call all the normal methods on it and in addition it would be an instance of Discriminable such DKPro Lab recognizes it.
I had that - but this makes using the "Discriminable" Collection Reader Description optional - you could still throw in a normal CollectionReaderDescription and it would work. I thought this shouldn't be possible. The whole effort was made to get a readable string in the Discriminator.txt?
If you want the compiler to ensure that the reader is discriminable, then your approach is certainly better. I approached this thinking from another direction where I have some object and want to give it a name without changing all the code that actually uses this object.
And I am not sure if I understand how to implement the ReaderDimension
you suggested ... I inherit a class from Dimension and my proxy catches this dimension then i.e. every time Dimension is used?
The way I did it in ClosureDimension is to wrap the objects being passed to the dimension internally in the proxy, such that the objects returned from the dimension would all be both, Closures and Discriminable. I did it in the constructor, but it would also be possible to do it e.g. in next().
/**
* Create a new dimension of closures. These closures directly injected into the parameter
* space - they are not called (cf. {@link ClosureDimension#ClosureDimension(String, Closure...)}).
* The key associated with each closure is the value used when the closure used as a
* discriminator.
*/
public ClosureDimension(String aName, Map<String, Closure> aClosures)
{
super(aName);
closures = new Closure[aClosures.size()];
int i = 0;
for (Entry<String, Closure> e : aClosures.entrySet()) {
closures[i] = new DiscriminableClosure(e.getValue(),e.getKey());
i++;
}
}
In your case, you'd not use a map as parameter, but instead generate the key from the reader description (in the way that you have already implemented).
I still don't get how to implement this ReaderDimensions(). A dimension has a name that is passed to its super class where are the parameters coming from?
Consider something like
public ReaderDimension(String aName, CollectionReaderDescription... aReaders)
{
super(aName);
readers = new CollectionReaderDescription[aReaders()];
int i = 0;
for (CollectionReaderDescription r : aReaders) {
readers[i] = Proxy.newProxyInstance(classloader, new Class[] { CollectionReaderDescription, Discriminable }, new ReaderInvokationHandler(aReader) );
i++;
}
}
Updated last comment to use an actual proper proxy...
Thanks. I still don't understand why you would treat all ReaderDescription together?
I still have to assign the readers to a dimension i.e. TRAIN_READER or TEST_READER
So I would want to do something like:
ReaderDimension trainDim = new ReaderDimension(DIM_READER_TRAIN, CollectionReaderFactory.createReaderDescription(BrownCorpusReader.class, BrownCorpusReader.PARAM_LANGUAGE, "en",
BrownCorpusReader.PARAM_SOURCE_LOCATION, corpusFilePathTrain,
BrownCorpusReader.PARAM_PATTERNS,
asList(INCLUDE_PREFIX + "*.xml", INCLUDE_PREFIX + "*.xml.gz")));
But this seems a bit redundantly because I now still would have to explicitly put it into the dimension map with the DIM_READER_TRAIN
key
dimReaders.put(DIM_READER_TRAIN, trainDim);
For this to make really sense and be different from what I have now I would want the InitTask
to fail if you put in an ordinary reader.
i.e. that what I have right now is essential the same, no?:
Object newInstance = DiscriminableReaderCollectionFactory.createReaderDescription(BrownCorpusReader.class, BrownCorpusReader.PARAM_LANGUAGE, "en",
BrownCorpusReader.PARAM_SOURCE_LOCATION, corpusFilePathTrain,
BrownCorpusReader.PARAM_PATTERNS,
asList(INCLUDE_PREFIX + "*.xml", INCLUDE_PREFIX + "*.xml.gz"));
dimReaders.put(DIM_READER_TRAIN, newInstance);
I may be thinking about this from a slightly wrong angle in terms of how it fits into DKPro TC because I don't see this atm in a concrete example/experiment.
oh ok.
In a scenario where you have to provide train/test as separate instances: Is there a way to guarantee that a collectionReaderDesc
is discriminable without enforcing to use a new data type? The way how it is now makes it optional, but causes no real problems either unless you are concerned about the DISCRIMINATOR.txt
Without looking at how TC presently works, I would be assuming/thinking/imagining the following:
We definitely need two dimensions as train and test might be totally different corpora. Richard Eckart de Castilho notifications@github.com schrieb am Sa., 14. Mai 2016 um 10:19:
Without looking at how TC presently works, I would be assuming/thinking/imagining the following:
- either I supply one reader dimension (e.g. DIM_READER) and TC internally splits the data into train and test without me intervening at all
- or there would be two different dimensions that I need to supply, e.g. DIM_READER_TRAIN and DIM_READER_TEST and I would set them independently. In this case, if I wanted to run my experiment on multiple data sets and always would have to set fixed pairs of train/test readers, then I would use a DimensionBundle to group them together.
— You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub https://github.com/dkpro/dkpro-tc/issues/353#issuecomment-219207831
What we could do to enforce a Discriminable CRD
is to let the factory DiscriminableReaderCollectionFactory.createReaderDescription
return a Discriminable
.
The InitTask
would then expect a Discriminable
too but would cast it to a CRD
which of course only works if the DiscriminableReaderCollectionFactory
is used.
Discriminable readerTrain = DiscriminableReaderCollectionFactory.createReaderDescription(
ReutersCorpusReader.class, ReutersCorpusReader.PARAM_SOURCE_LOCATION,
FILEPATH_TRAIN, ReutersCorpusReader.PARAM_GOLD_LABEL_FILE, FILEPATH_GOLD_LABELS,
ReutersCorpusReader.PARAM_LANGUAGE, LANGUAGE_CODE,
ReutersCorpusReader.PARAM_PATTERNS, ReutersCorpusReader.INCLUDE_PREFIX + "*.txt");
dimReaders.put(DIM_READER_TRAIN, readerTrain);
////////////////////////
public class InitTask
extends UimaTaskBase
{
@Discriminator(name=DIM_READER_TRAIN)
protected Discriminable readerTrain;
@Discriminator(name=DIM_READER_TEST)
protected Discriminable readerTest;
...
readerDesc = (CollectionReaderDescription) readerTrain;
}
This would ensure that you have a discriminable reader and it is not working to throw in a normal CollectionReaderDescription.
I think this is a bit more elegant than what I had before.
@zesch @daxenberger opinions on that?
I would personally prefer not bothering the user with having to use a special factory that creates discriminables, rather ensure that DKPro TC internally wraps reader descriptions accordingly whenever the user supplies non-discriminable descriptions. I feel this should be an internal implementation detail.
Btw. it may even be possible to not fiddle around with the dimensions at all and let the task do the wrapping.
If I remember correctly, then DKPro Lab does
I think it should be possible to do a transparent wrapping by implementing/overriding a getReader()
method in the respective task and return a discriminable reader at that point.
Didn't really test it though.
Cf.
org.dkpro.lab.engine.impl.DefaultLifeCycleManager.initialize(TaskContext, Task)
org.dkpro.lab.task.impl.TaskBase.analyze(Class<?>, Class<? extends Annotation>, Map<String, String>)
Okay I thought too complicated it is quite easy to do it hidden in the initTask as you said - thx.
@reckart The DISCRIMINATOR.txt
still has an extremely long string i.e. not the value my Discriminable
interface returns. I currently wrap the reader descriptions in the InitTask
this might be a bit too late? When the DISCRIMINATOR.txt
written? Also, the Excel files create contain the Evaluationresults also get this endless long text.
It looks like the right place would be somewhere in the Lab code. I think all I can do is going back to defining a new type and force the user to use this type right from the start?
Is there some unit test where this can easily be observed/debugged?
Any demo should show that problem
org.dkpro.tc.examples.single.sequence.CRFSuiteBrownPosDemo
I tried an override of the initialize(TaskContext) method in InitTask and did the wrapping there. This works in the sense that the discriminable value is stored properly in the DISCRIMINATOR.txt file, but then the imports break because in the next task that imports from the InitTask, the descriptor is again not wrapped.
This quite effectively shows that values need already be discriminable within the parameter space, e.g. through a dimension that wraps the values.
I think the next thing I would go for is to extend DKPro Lab to support registering global discriminables, e.g. something like
pSpace.registerDiscriminable(
CollectionReaderDescription.class, desc -> makeDiscriminable(desc));
or creating a new "conversionService" and using something like
taskContext.getConversionService().registerDiscriminable(
CollectionReaderDescription.class, desc -> makeDiscriminable(desc));
... not sure what would be the best location to put this.
Allowing to register special discriminator handlers for specific classes in a suitable place in DKPro Lab could even entirely remove the need to have them implement the Discriminable interface. The Lab could know if they are discriminable by looking in its internal registry of discriminables instead and looking up a proper naming function instead of using the Discriminable interface methods.
I think I like the registerDiscriminable()
:-)
Would it take much time to implement that?
It would probably take a few attempts or at least some thinking it through to consider all the requiremnts, find a suiteable place to implement this, and then ensure all the necessary information is available at that location.
I would personally tend towards registering the discriminables in the Lab instance through the TaskContext and then using that information again in Task.analyze(),
At the moment a user have to define 2 dimensions for a reader. Tc builds in the backend a collection reader description from those information.
i.e.
it would be good if a user could just initialise the collection reader and pass it as parameter to TC/Lab. This would make using TC more similar to using Core at least in the sense of specifying the readers.