Closed reckart closed 9 years ago
The simplest case of annotating text with Uby information is to annotate tokens (based
on their lemmas).
There is a lot that can be annotated at the token level. If you just consider semantic
tags, a wide variety of different "semantic tagsets" can be derived from Uby and used
for tagging.
Therefore, my impression was that it might be useful to keep information of the specific
"semantic Uby tagset" used for tagging.
>> However, Uby will probably be only one possible data source for such information.
Sure, the information that is annotated is not Uby-specific at all. I just mention
Uby here, because it is the only lexical resource I am working with (quite ok, since
it contains 10 lexical resources ...)
So there is no need to mention Uby anywhere in the type names.
>> You want Uby specific stuff.
Actually, I can not think of any Uby-specific stuff to annotate. All that Uby provides
is ordinary lexical information, but at a scale that is typically not reachable by
single lexical resources.
Original issue reported on code.google.com by eckle.kohler
on 2013-06-26 11:33:14
Would this require to disambiguate first?
I guess that semantic tags are quite specific to senses.
Original issue reported on code.google.com by torsten.zesch
on 2013-06-26 11:39:31
That depends on the specific semantic tagset used for annotating.
There are cases where disambiguation is not necessary or very simple.
For other semantic tags, the annotator might have to perform some kind of WSD.
Original issue reported on code.google.com by eckle.kohler
on 2013-06-26 11:51:39
Great. I am looking forward to the prototype.
Original issue reported on code.google.com by torsten.zesch
on 2013-06-26 11:52:54
I wonder how we'll do the interfacing between DKPro Core and Uby:
a) have a "uby" module in DKPro Core with a couple of annotators
b) have a "uima" module in Uby with a couple of annotators
c) define resource APIs (e.g. "Dictionary") and generic annotators (e.g. "DictionaryAnnotator)"
in DKPro Core and provide implementations of that in Uby.
I think "c" would definitely be the coolest one.
Original issue reported on code.google.com by richard.eckart
on 2013-06-30 17:12:41
I also like c) as it aligns best with the "Uby is a excellent source for information
xyz, but certainly not the only one" paradigm discussed above.
Original issue reported on code.google.com by torsten.zesch
on 2013-06-30 17:57:29
c) +1
BTW: does this still fit with a UbyResourceLocator in uby? (which is living there already
in a uima module created today)
Original issue reported on code.google.com by eckle.kohler
on 2013-06-30 18:00:07
Sure, why not. I imagine for somebody wanting to code a custom component (not resource)
using Uby, the locator should be convenient.
At this point, I couldn't say it would be more convenient if a hypothetical "UbyDictionary"
would use it or if it would have its own internal Uby instance.
Original issue reported on code.google.com by richard.eckart
on 2013-06-30 18:07:14
I have a couple of questions and remarks regarding the DKPro-Core part of the UBY-Core
Interface:
- as a name for the generic interface I would prefer SemanticLabelProvider instead
of Dictionary. I see many similarities to the FrequencyCountProvider in DKPro-Core,
whereas Dictionary seems to be too focussed on the use of dictionaries in my opinion.
This interface would define a method
String getSemanticLabel(String lemma, String POS, String semanticLabelType)
These parameters are actually necessary to implement a generic interface which can
also be implemented by a UbySemanticLabelProvider.
Regarding the Dictionary interface in decompounding, I have a number of questions and
comments that might be discussed elsewhere.
- Is it necessary to implement the UbySemanticLabelProvider as a UIMA resource, i.e.
subclassing Resource_ImplBase in uimaFIT? The FrequencyCountProvider seems not to be
implemented this way.
- I definitely need an annotation type such as SemanticLabel or SemanticCategory with
two features, namely
type (type of the semantic label/category) and
value (type of the semantic label/category).
SemanticLabel might sound too UBY specific. However, the type would be very general:
Examples:
type=semanticField, value=location, person, ...
type=domain, value=Computer, Education, Chemistry, ...
I tried to motivate that already in this discussion:
https://groups.google.com/forum/#!searchin/dkpro-core-developers/uby/dkpro-core-developers/_eCGNb8bUdE/gvV3loucYpAJ
but within this discussion, a kind of misunderstanding occurred.
The new annotation type I need would be quite general and not UBY-specific and not
at all related to the Types which are already available for Named Entities.
A UbySemanticLabelAnnotator will annotate the following word classes with a semantic
category or label: common nouns, main verbs, adjectives.
It will not annotate any proper nouns.
I could also introduce such an annotation type in Uby. But that might be a first step
to a parallel type system.
Best
Judith
Original issue reported on code.google.com by eckle.kohler
on 2013-07-28 20:00:25
Regarding a new annotation type for semantic field information from WordNet:
This kind of lexical information is actually well established in papers that use lexical
resources for IE or Text Classification.
However, they are called differently in the literature:
- WordNet lexicographer file names (the very literal name of these tags)
- supersenses, supersense tagging
- semantic fields
I searched on the ACL anthology workbench to get some evidence:
http://aclasb.dfki.de/#txt~p|WordNet%20supersense* (17 hits)
http://aclasb.dfki.de/#txt~p|WordNet%20semantic%20field*doc~W04-0813*
They use semantic field features as well:
Dirk Hovy, Shashank Shrivastava, Sujay Kumar Jauhar, Mrinmaya Sachan, Kartik Goyal,
Huying Li, Whitney Sanders and Eduard Hovy: Identifying Metaphorical Word Use with
Tree Kernels. NAACL HLT Meta4NLP Workshop, 2013.
I used this annotation too (extensively) in recent research (with good results).
So a type SemanticField with a "value" feature might be something worth considering.
Judith
Original issue reported on code.google.com by eckle.kohler
on 2013-07-31 19:34:37
Here is my plan:
- create a new package dictionaryannotator.semantictagging in the module dictionaryannotator-asl
- add to this new package: an Interface SemanticTagProvider, a UIMA resource SimpleSemanticTagProvider
and an annotator SimpleSemanticTagAnnotator that uses a key-value map as resource (retrieved
from a file). The annotator will use the Named entity type for now or another generic
one.
- add test cases for the SimpleSemanticTagAnnotator
The other side of the interface will go to UBY:
- create a new module uby.core-asl
- add resources that inherit from Resource_ImplBase and implement the SemanticTagProvider:
a UbySemanticFieldProvider, UbySemanticFrameProvider, UbyDomainProvider
- add the corresponding annotators that annotate tokens (phrases will be considered
later) with these tags
(I will use existing annotation types for now)
any objections?
Original issue reported on code.google.com by eckle.kohler
on 2013-08-02 13:31:00
For the first shot, I'd suggest to keep all of the stuff in one module, either on the
Uby or on the DKPro Core side. I'd suggest dumping it into the dictionaryannotator
module right now. Moving code around to better places and/or renaming can be done when
it works.
Original issue reported on code.google.com by richard.eckart
on 2013-08-02 13:35:20
I finished the first round and implemented
- SemanticTagProvider (Interface)
- NounSemanticFieldResource
- NounSemanticFieldAnnotator
and a test class for the annotator:
- NounSemanticFieldAnnotatorTest along with a tiny test resource nounSemanticFieldMapTest.txt
In the test class I use the AssertAnnotations.assertNamedEntity convenience method
from testing-asl. However, my test turned only green, when I added a modified version
of assertNamedEntity without the param. aExpectedMapped.
In my case, there is no mapping between original and DKPro-Core NE values/types.
The method I added looks like this:
public static void assertNamedEntity(String[] aExpectedOriginal,
Collection<NamedEntity> aActual)
Isn't there a way to use the original method
assertNamedEntity(String[] aExpectedMapped, String[] aExpectedOriginal,
Collection<NamedEntity> aActual)
in a way that does not assume a mapping? I tried several versions with aExpectedMapped
and aExpectedOriginal set to the same String[], but it did not work.
Otherwise, can I add the
public static void assertNamedEntity(String[] aExpectedOriginal,
Collection<NamedEntity> aActual)
to AssertAnnotations?
Judith
Original issue reported on code.google.com by eckle.kohler
on 2013-08-04 11:48:46
Did you try using passing "null" as aExpectedMapped? Looking at the method, it should
ignore that argument if it is null.
Original issue reported on code.google.com by richard.eckart
on 2013-08-04 11:52:23
yes, I did and it does not work:
AssertAnnotations.assertNamedEntity(null,documentNounSemanticFields,
select(aJCas, NamedEntity.class));
yields
java.lang.NullPointerException
at java.util.Arrays$ArrayList.<init>(Arrays.java:2842)
at java.util.Arrays.asList(Arrays.java:2828)
at de.tudarmstadt.ukp.dkpro.core.testing.AssertAnnotations.assertNamedEntity(AssertAnnotations.java:199)
at de.tudarmstadt.ukp.dkpro.core.dictionaryannotator.semantictagging.NounSemanticFieldAnnotatorTest.runAnnotatorTest(NounSemanticFieldAnnotatorTest.java:109)
at de.tudarmstadt.ukp.dkpro.core.dictionaryannotator.semantictagging.NounSemanticFieldAnnotatorTest.testGermanSeparatedParticles(NounSemanticFieldAnnotatorTest.java:37)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
Original issue reported on code.google.com by eckle.kohler
on 2013-08-04 21:30:36
I've fixed the NPE in assertNamedEntity for your case.
Original issue reported on code.google.com by richard.eckart
on 2013-08-05 08:47:10
Thanks for fixing the assertNamedEntity, Richard.
I have a question regarding the key/value resource file that contains the noun lemmas
and their WordNet semantic field. Where should this resource go? Are there any naming
conventions for such files?
The size of the file is 2,3 MB
Original issue reported on code.google.com by eckle.kohler
on 2013-08-05 20:05:53
I thought the idea was to access the Uby database directly?
Otherwise, I suppose this would be a resources to be packaged as a JAR file and to
go into the Maven repository.
Original issue reported on code.google.com by richard.eckart
on 2013-08-05 20:15:37
>> I thought the idea was to access the Uby database directly?
right, this is the idea.
The file resource with the WordNet semantic fields just turned out to be very useful
and broadly applicable, so I extracted this information into a file for efficiency
reasons.
And thought other people might be interested in using it as well, because it does not
require to install a database.
Now I will implement 2 UBY specific pairs of resources and annotators:
- UbySemanticPredicateResource and UbySemanticPredicateAnnotator (will use the type
SemanticPredicate)
- UbyDomainLabelResource and UbyDomainLabelAnnotator (will use the type field from
api.structure)
These will access the UBY DB directly and also exploit the sense links in particular
ways.
Original issue reported on code.google.com by eckle.kohler
on 2013-08-06 03:36:23
So currently, we have these build.xml files which download resources from their original
websites, package them, and upload them to our Maven repository. If there is no "original
website" for a resource, e.g. for your list, we so far host them in the downloads section
of the DKPro Core ASL google project (which will go away soon, so some different hosting
location will be required).
Original issue reported on code.google.com by richard.eckart
on 2013-08-06 08:44:20
For the UBY specific resources I need to create a mapping between
- Core POS tags and UBY POS tags
- Core language information (ISO 2-letter code) and UBY language information (ISO 3-letter
code)
Is it sensible to assume that for all the POS taggers integrated in DKPro-Core (English
and German), a mapping exists that maps the original POS tags to Core POS types?
Original issue reported on code.google.com by eckle.kohler
on 2013-08-10 17:12:39
German POS models usually use STTS and English POS models usually use PTB. Both are
mapped.
Are the UBY POS tags language specific?
Original issue reported on code.google.com by richard.eckart
on 2013-08-10 17:15:54
>> German POS models usually use STTS and English POS models usually use PTB. Both are
mapped.
fine.
>> Are the UBY POS tags language specific?
No, they are designed to be language-independent.
But a Uby-specific resource that implements the getSemanticTag method needs POS and
lemma information to access the lexical entry.
And the language information to pre-select the Uby lexicon to use.
This is important in order to throw appropriate exceptions that inform the user if
e.g. the German lexicon GermaNet is missing in UBY.
Original issue reported on code.google.com by eckle.kohler
on 2013-08-10 17:28:09
Issue 169. Commited UbySemanticFieldResource, UbySemanticFieldAnnotator and UbyResourceUtils
The test class UbySemanticFieldAnnotatorTest successfully runs a test on a real (MySQL)
DB, therefore the test method is ignored.
A suitable test case for an in-memory UBY DB should be added.
Original issue reported on code.google.com by eckle.kohler
on 2013-08-12 08:40:08
test case for an in-memory UBY DB was added.
see http://code.google.com/p/dkpro-core-asl/source/detail?r=1791
Original issue reported on code.google.com by eckle.kohler
on 2013-08-18 13:42:26
(No text was entered with this change)
Original issue reported on code.google.com by richard.eckart
on 2013-09-12 19:59:57
I think the NounSemanticFieldAnnotator and the NounSemanticFieldAnnotatorTest can be
removed.
Additional parameters that could be added to the SemanticFieldAnnotator:
- maybe language (?)
- token vs. phrase annotation
Original issue reported on code.google.com by eckle.kohler
on 2013-09-14 20:40:22
(No text was entered with this change)
Original issue reported on code.google.com by richard.eckart
on 2013-09-17 14:42:35
(No text was entered with this change)
Original issue reported on code.google.com by richard.eckart
on 2014-03-26 10:51:39
I believe we do now have implementations of the ideas presented here on the sides of
DKPro Core in the dictionaryannotator module and on the side of Uby in the form of
resources that can be used with the dictionaryannotator code, right? If so, this could
be resolved.
Original issue reported on code.google.com by richard.eckart
on 2014-05-26 22:17:49
Separate issues could be opened for specific extensions, e.g. for passing the language
through.
Original issue reported on code.google.com by richard.eckart
on 2014-05-26 22:18:28
>>I believe we do now have implementations of the ideas presented here on the sides
of DKPro Core in the dictionaryannotator module and on the side of Uby in the form
of resources that can be used with the dictionaryannotator code, right?
Actually, this issue should be closed as won't fix.
Another issue could be opened titled "Tag text with information from wordlists". And
this issue can be marked as resolved.
The resource AND annotators that tag text with information from Uby have been moved
to Uby. The reason for this was the fact that Uby is not yet on Maven Central.
>> Separate issues could be opened for specific extensions, e.g. for passing the language
through.
Right.
Another extension would be to tag not only tokens, but also noun chunks.
I already have implemented that. But would need help in setting up the test case, because
last time I could not find out how chunks are composed/built in a test case.
Original issue reported on code.google.com by eckle.kohler
on 2014-05-27 06:50:27
Renaming and closing as fixed.
Original issue reported on code.google.com by richard.eckart
on 2014-05-27 08:08:40
Original issue reported on code.google.com by
torsten.zesch
on 2013-06-26 10:46:47