Preserve information about tagset

reckart commented 9 years ago

There is no way to store in the DKPro type system which tagset is used in a particular
layer, e.g. that the part-of-speech layer uses the "STTS" tagset. 

If we have a writer for a data format which can make use of this information, we currently
have to introduce special writer parameters or find some other way to passing this
information to the write. It would be better, if the writer could just read this information
from the CAS.

An option would be to introduce a new type "TagSet" which has a "name" (e.g. STTS)
and a "layer" (e.g. POS). It could also be considered to introduce a type "Tag" which
contains the individual tags within the tagset.

Any component which is creating annotations with tag values, should then create such
an annotation in the CAS and populate it.

Original issue reported on code.google.com by richard.eckart on 2013-06-25 10:45:51

reckart commented 9 years ago

(No text was entered with this change)

Original issue reported on code.google.com by richard.eckart on 2013-06-25 10:46:02

reckart commented 9 years ago

(No text was entered with this change)

Original issue reported on code.google.com by richard.eckart on 2013-06-25 10:56:36

reckart commented 9 years ago

Hi,
this sounds very useful and important. Could such a type be used for tagging text with
UBY-"tags"?
E.g., with the "TagSet" type version, that would be something like "name"= ubySemanticTag
and "layer" = semantics

Best
Judith

Original issue reported on code.google.com by eckle.kohler on 2013-06-26 10:20:14

reckart commented 9 years ago

What I suggest is to preserve information about the tagset, its name, layer, and its
tags. This is meta information which doesn't actually tag anything. 

For example, consider running the OpenNLP tagger with a model for German. It creates
annotations of the type POS (or subtypes) which carry a feature "posValue". What actual
values can "posValue" assume and from which inventory do they come? To record this
information, one single "TagSet" feature structure (not annotation) could be added
to the CAS such as:

TagSet {
  name: "STTS"
  layer= "de.tudarmstadt.ukp.dkpro.core.api.lexmorph.POS"
  tags = { "NNP", "NN", "ADJ", ... }
}

None of this information refers directly to the text. The text, however, is annotated
with the POS annotations, e.g.

de.tudarmstadt.ukp.dkpro.core.api.lexmorph.POS {
  begin: 10
  end: 15
  posValue: "NNP"
}

What does "NNP" mean? I could look up the TagSet feature structures, search for the
one that applies to the POS layer (de.tudarmstadt.ukp.dkpro.core.api.lexmorph.POS or
any subtypes) and see that "NNP" belongs to the "STTS" tagset. It could be imagined
to add a link to some external normative resource, e.g. ISOCat. For the moment, that's
beyond my use-case, though.

I imagine this can be used for components such as POS taggers or parsers, but e.g.
not for lemmatizers or stemmers, because these do not have the notion of a closed controlled
vocabulary. Or if the have, the vocabulary may be very large and it would be inconvenient
to fully record it in the CAS.

If there would be an annotator which used Uby to create annotations, I could imagine
that this annotator could also add TagSet information to the CAS, informing the user/downstream
components which controlled vocabulary the "tags" come form. I fear, though, that a
user of Uby may be looking for either something way more sophisticated that what I
suggest here, e.g. recoding full lexical entries in the CAS, or more simple, e.g. recording
a link to a Uby lexical entry in the CAS.

Original issue reported on code.google.com by richard.eckart on 2013-06-26 10:37:02

reckart commented 9 years ago

>> I fear, though, that a user of Uby may be looking for either something way more sophisticated
that what I suggest here, e.g. recoding full lexical entries in the CAS, or more simple,
e.g. recording a link to a Uby lexical entry in the CAS. 

In many applications, a user might not be interested in such complex information from
Uby. So your new type might be actually useful for semantic tagging with Uby.
We should discuss it F2F, because I aggregated some more ideas on that.

Original issue reported on code.google.com by eckle.kohler on 2013-06-26 10:44:22

reckart commented 9 years ago

My primary use case right now would be to write this tagset information in a writer.

In the TcfWriter from WebAnno, the tagset names are currently hard-coded, which is
bad. I would like to avoid having to add parameters for the tagset names and instead
read them from the CAS.

The tagset information could also be used by other writers. E.g the Negra export format
supports tagset definitions. We do not have a NegraExportWriter yet, We have a NegraExportReader,
though, which could actually read tagset information from Negra files and record it
in the CAS.

Another conceivable use-case would be to have components validate their compatibility
at runtime. We noted that the Penn Tagset used by the TreeTagger model for English
is not the same as the one expected by the Stanford parser. This cause problems when
we used the TreeTagger to create POS tags and then used the StanfordParser only to
created the constituency structure, based on the TreeTagger POS tags. If the TreeTagger
component recorded the tagset in the CAS, the StanfordParser could look at this information
and issue a warning or error if it the tagset does not correspond to the ones expected
by the parser.

Original issue reported on code.google.com by richard.eckart on 2013-06-26 10:53:41

reckart commented 9 years ago

(No text was entered with this change)

Original issue reported on code.google.com by richard.eckart on 2013-06-27 16:31:24

reckart commented 9 years ago

(No text was entered with this change)

Original issue reported on code.google.com by richard.eckart on 2013-08-04 09:14:36

Labels added: Milestone-1.5.0, Module-api.resources

dkpro / dkpro-core

Preserve information about tagset #168