dkpro / dkpro-tc

UIMA-based text classification framework built on top of DKPro Core and DKPro Lab.
https://dkpro.github.io/dkpro-tc/
Other
34 stars 19 forks source link

Unit classification mode: Integrate TextClassificationUnit.setSuffix() with externalD #251

Open daxenberger opened 9 years ago

daxenberger commented 9 years ago

In unit mode, the id2outcome report contains random IDs for units in documents with more than one unit. It would be much easier to interpret and analyze classification errors, if the IDs were interpretable. To make this happen, we should introduce a new feature to the TextClassificationUnit type which holds (optionally) stores a user-specified string. This field can be set together with the label for this unit during reading. In ClassificationUnitCasMultiplier, where the units are "converted" into instances/CASes, this fields should be used to set the document ID of the CAS (which is later used by the id2outcome report). If no user-specified IDs are set, the current approach (increasing index for multiple units in a document) can be applied instead.

zesch commented 9 years ago

We already have something similar in the form of the "suffix" field in the TextClassificationUnit type. This does not replace numerical ids, but as the name says is added as a suffix.

daxenberger commented 9 years ago

True. However, AFAIK this suffix is only explicitely used in a feature (ID_FEATURE_NAME) and is not applied within id2outcome report. I guess, the best option is to use this (string) value to append it to the document ID in ClassificationUnitCasMultiplier, rather than adding a new one. The IDs do not need to be numeric.

zesch commented 9 years ago

They need to be unique. When only using a user-defined name, we cannot rely on that.

daxenberger commented 9 years ago

We need to make sure that document IDs are unique as well. I think, there, we also rely on the user to set unique values, don't we? I guess uima(FIT) will complain if the IDs are not unique.

zesch commented 9 years ago

Yes, that is right. As document IDs are usually file names, the problem is less critical there as they need to be kind of unique already. As we have multiple units per document, there is some danger that we will get CASes with the same name here.

However, just pointing out the problem. Whether we want to take action or leave it to the user is another question.

reckart commented 9 years ago

We need to make sure that document IDs are unique as well. I think, there, we also rely on the user to set unique values, don't we? I guess uima(FIT) will complain if the IDs are not unique.

I don't see why uima(FIT) should complain about any IDs not being unique. From the perspective of these frameworks, I'd say the IDs you are speaking about are "user-defined" and oblique to the framework.

daxenberger commented 9 years ago

Within a single document, it is easy to ensure that user-specified IDs are unique by verifying in the ClassificationUnitCasMultiplier. If user-specified IDs are not unique within the same document, we can append numeric indices and output a warning.

reckart commented 9 years ago

Btw. external IDs are a very common request... a related issue is e.g. https://github.com/dkpro/dkpro-core/issues/609

daxenberger commented 9 years ago

The feature requested in https://github.com/dkpro/dkpro-core/issues/609 will be bound to sentences or tokens, right? In DKPro TC we need to bind it to units (which can be sentences, but also many other things). If there is a way to combine these requests and to make it a common feature, that would be good; but at the moment I have no idea how this could be done.

reckart commented 9 years ago

Well... we don't have a common base-class for all DKPro annotations and I personally would prefer not to introduce one (gut feeling). If we had such a base-class, we could introduce all kinds of common features there and they would be automatically inherited. But we'd risk inheriting such common features where they are not actually needed and there is no way of "removing" features in subclasses.

Unfortunately, UIMA currently doesn't allow adding "interfaces" to JCas classes. One could imagine to add an interface "HasExternalId" to certain types which would then have to define a "externalId" field along with getters and setters. Cf. https://issues.apache.org/jira/browse/UIMA-3354

So currently, I tend towards introducing a set of "helper" classes, e.g.

ExternalIdHelper.getExternalId(annotation); 
ExternalIdHelper.setExternalId(annotation, id);

If the annotation type defines a field "externalId", the methods would work, otherwise they would throw an exception.

Btw. the same "helper" class solution would allow removing customized code from the generated JCas classes, e.g. from DocumentMetaData. That in turn would allow us to remove the JCas classes from the repository and automatically generate all of them during build. Currently, most JCas classes in DKPro Core are generated during build, but some are not due to customizations.

reckart commented 9 years ago

Btw. there are some considerations towards making JCas more dynamic in future versions of UIMA:

https://cwiki.apache.org/confluence/display/UIMA/Ideas+for+UIMAJ+v3#IdeasforUIMAJv3-AutomaticJCasGenofmergedtypesystem

logological commented 9 years ago

Not only the instance IDs but the label IDs should be consistent from report to report. Is there already an issue for that?

daxenberger commented 9 years ago

So currently, I tend towards introducing a set of "helper" classes, e.g. ExternalIdHelper.getExternalId(annotation); ExternalIdHelper.setExternalId(annotation, id); If the annotation type defines a field "externalId", the methods would work, otherwise they would throw an exception.

Sounds good to me. That would require us to agree upon a common name for that field.

reckart commented 9 years ago

It would merely require that you don't object to using "externalId" ;)

reckart commented 9 years ago

Mind that "externalId" is the name I have in mind for DKPro Core. It implies that a single annotation type cannot have more than one external id. In case you store labelId and instanceId in a single type, a different DKPro TC solution would be required.

... or we could maintain multiple external IDs, e.g.

ExternalIdHelper.getExternalId(annotation); 
ExternalIdHelper.setExternalId(annotation, idNamespace, id);

But I'm not sure... it seems a bit of an overkill. I believe in the majority of cases, only one external ID would be required.

daxenberger commented 9 years ago

Not only the instance IDs but the label IDs should be consistent from report to report. Is there already an issue for that?

I'm not exactly sure whether I understand that request. units (TextClassificationUnit) and label (TextClassificationOutcome) annotations need to span across the same text. So it should be enough to have unique IDs with the units. Or do you have another application in mind?

logological commented 9 years ago

Currently id2outcome.txt uses numeric IDs for the classification outcomes, but (at least for cross-validation experiments) these IDs are not consistent from file to file. For example, BrownPosDemo does a two-fold cross-validation. One of the id2outcome.txt file uses the following mapping of numeric IDs to the original labels:

0=NPg 2=JJ 1=(null) 3=RB 5=TO 4=PPS 6=RP 7=NP 8=NN 10=VBN 9=VB 11=pct 12=PPO 13=BE 14=MD 15=DTS 16=VBZ 17=AT 18=IN 19=CS 20=VBG 21=VBD 22=BEDZ 23=NNS 24=CC 25=CD 26=AP 27=PPg

The other id2outcome.txt file uses a slightly different mapping:

0=NPg 2=(null) 1=JJ 3=RB 5=PPS 4=TO 6=RP 7=NP 8=NN 10=VB 9=VBN 11=pct 12=PPO 13=BE 14=MD 15=DTS 16=VBZ 17=AT 18=IN 19=CS 20=VBG 21=VBD 22=BEDZ 23=NNS 24=CC 25=CD 26=AP 27=PPg

In order to get the raw classifications, not only do I need to combine the two files, but I first have to manually un-map all the numeric IDs to their original labels.

It would be better if id2outcome.txt didn't use numeric IDs at all, but rather used the original label IDs. If for some reason the mapping to numeric IDs is necessary, it would be helpful if the mapping were consistent across files.

EmilyKJamison commented 9 years ago

@Tristan: Whoops, we have two different discussions here. User-set unit ids is a separate issue from maintaining continuity in numeric ids of classifications in a CV task.

@Johannes, CC:Tristan: Yes, TextClassificationUnit.setSuffix(String v) does show up in id2outcome, along with the auto-generated numeric unit id. AFAIK, the proposed issue is that it would still be desirable to remove the auto-generated unit id.

On Fri, Jul 10, 2015 at 12:21 PM, Tristan Miller notifications@github.com wrote:

Currently id2outcome.txt uses numeric IDs for the classification outcomes, but (at least for cross-validation experiments) these IDs are not consistent from file to file. For example, BrownPosDemo does a two-fold cross-validation. One of the id2outcome.txt file uses the following mapping of numeric IDs to the original labels:

0=NPg 2=JJ 1=(null) 3=RB 5=TO 4=PPS 6=RP 7=NP 8=NN 10=VBN 9=VB 11=pct 12=PPO 13=BE 14=MD 15=DTS 16=VBZ 17=AT 18=IN 19=CS 20=VBG 21=VBD 22=BEDZ 23=NNS 24=CC 25=CD 26=AP 27=PPg

The other id2outcome.txt file uses a slightly different mapping:

0=NPg 2=(null) 1=JJ 3=RB 5=PPS 4=TO 6=RP 7=NP 8=NN 10=VB 9=VBN 11=pct 12=PPO 13=BE 14=MD 15=DTS 16=VBZ 17=AT 18=IN 19=CS 20=VBG 21=VBD 22=BEDZ 23=NNS 24=CC 25=CD 26=AP 27=PPg

In order to get the raw classifications, not only do I need to combine the two files, but I first have to manually un-map all the numeric IDs to their original labels.

It would be better if id2outcome.txt didn't use numeric IDs at all, but rather used the original label IDs. If for some reason the mapping to numeric IDs is necessary, it would be helpful if the mapping were consistent across files.

— Reply to this email directly or view it on GitHub https://github.com/dkpro/dkpro-tc/issues/251#issuecomment-120355670.

logological commented 9 years ago

OK, I've opened a separate issue for the outcome classification labels (Issue 252).

daxenberger commented 9 years ago

Emily is right. These are two issues (the second to be tracked in Issue 252). TextClassificationUnit.setSuffix(String v) does indeed show up in the id2outcome report. I'll add a line to the BrownCorpusReader to show its usage and to make it effective in the BrownUnitPOSDemo. This should solve the original request.

I'm leaving this issue open to integrate it with DKPro's externalId.

Horsmann commented 8 years ago

integrate it with DKPro's externalId. What is left to do for that?

reckart commented 8 years ago

Core didn't implement an "externalId" mechanism yet.

Horsmann commented 6 years ago

Any updates on this one?

reckart commented 6 years ago

Nope.

Horsmann commented 6 years ago

Ok, then I close this for now. Once the demand for this kind of change becomes bigger a new issue can be opened.

reckart commented 6 years ago

Why not leave it open and push it to the next release? If it get's closed, its easier to miss past discussions on the topic.