dkpro / dkpro-core

Collection of software components for natural language processing (NLP) based on the Apache UIMA framework.
https://dkpro.github.io/dkpro-core
Other
196 stars 67 forks source link

Tag text with information from wordlists #169

Closed reckart closed 9 years ago

reckart commented 9 years ago
I moved that discussion to an own issue.

Original question was:

---
this sounds very useful and important. Could such a type be used for tagging text with
UBY-"tags"?
E.g., with the "TagSet" type version, that would be something like "name"= ubySemanticTag
and "layer" = semantics

---

a) You want to annotate specific things with Uby as a data source. Here it depends
what you want to do. If you want to annotate lemmas, use the Lemma annotation. If you
want to annotate things from Uby where we do not have support for yet, this needs to
be discussed. However, Uby will probably be only one possible data source for such
information.

b) You want Uby specific stuff. This could rather reside in the Uby repository with
a special Uby annotation type that e.g. holds an id which can be used to access all
the wealth of Uby information if you need it.

Original issue reported on code.google.com by torsten.zesch on 2013-06-26 10:46:47

reckart commented 9 years ago
The simplest case of annotating text with Uby information is to annotate tokens (based
on their lemmas).

There is a lot that can be annotated at the token level. If you just consider semantic
tags, a wide variety of different "semantic tagsets" can be derived from Uby and used
for tagging.

Therefore, my impression was that it might be useful to keep information of the specific
"semantic Uby tagset" used for tagging.

>> However, Uby will probably be only one possible data source for such information.

Sure, the information that is annotated is not Uby-specific at all. I just mention
Uby here, because it is the only lexical resource I am working with (quite ok, since
it contains 10 lexical resources ...)
So there is no need to mention Uby anywhere in the type names.

>> You want Uby specific stuff. 
Actually, I can not think of any Uby-specific stuff to annotate. All that Uby provides
is ordinary lexical information, but at a scale that is typically not reachable by
single lexical resources.

Original issue reported on code.google.com by eckle.kohler on 2013-06-26 11:33:14

reckart commented 9 years ago
Would this require to disambiguate first?
I guess that semantic tags are quite specific to senses.

Original issue reported on code.google.com by torsten.zesch on 2013-06-26 11:39:31

reckart commented 9 years ago
That depends on the specific semantic tagset used for annotating. 

There are cases where disambiguation is not necessary or very simple.
For other semantic tags, the annotator might have to perform some kind of WSD.

Original issue reported on code.google.com by eckle.kohler on 2013-06-26 11:51:39

reckart commented 9 years ago
Great. I am looking forward to the prototype.

Original issue reported on code.google.com by torsten.zesch on 2013-06-26 11:52:54

reckart commented 9 years ago
I wonder how we'll do the interfacing between DKPro Core and Uby:

a) have a "uby" module in DKPro Core with a couple of annotators
b) have a "uima" module in Uby with a couple of annotators
c) define resource APIs (e.g. "Dictionary") and generic annotators (e.g. "DictionaryAnnotator)"
in DKPro Core and provide implementations of that in Uby.

I think "c" would definitely be the coolest one.

Original issue reported on code.google.com by richard.eckart on 2013-06-30 17:12:41

reckart commented 9 years ago
I also like c) as it aligns best with the "Uby is a excellent source for information
xyz, but certainly not the only one" paradigm discussed above.

Original issue reported on code.google.com by torsten.zesch on 2013-06-30 17:57:29

reckart commented 9 years ago
c) +1

BTW: does this still fit with a UbyResourceLocator in uby? (which is living there already
in a uima module created today)

Original issue reported on code.google.com by eckle.kohler on 2013-06-30 18:00:07

reckart commented 9 years ago
Sure, why not. I imagine for somebody wanting to code a custom component (not resource)
using Uby, the locator should be convenient.

At this point, I couldn't say it would be more convenient if a hypothetical "UbyDictionary"
would use it or if it would have its own internal Uby instance. 

Original issue reported on code.google.com by richard.eckart on 2013-06-30 18:07:14

reckart commented 9 years ago
I have a couple of questions and remarks regarding the DKPro-Core part of the UBY-Core
Interface:

- as a name for the generic interface I would prefer SemanticLabelProvider instead
of Dictionary. I see many similarities to the FrequencyCountProvider in DKPro-Core,
whereas Dictionary seems to be too focussed on the use of dictionaries in my opinion.
This interface would define a method 
String getSemanticLabel(String lemma, String POS, String semanticLabelType)

These parameters are actually necessary to implement a generic interface which can
also be implemented by a UbySemanticLabelProvider.

Regarding the Dictionary interface in decompounding, I have a number of questions and
comments that might be discussed elsewhere.

- Is it necessary to implement the UbySemanticLabelProvider as a UIMA resource, i.e.
subclassing Resource_ImplBase in uimaFIT? The FrequencyCountProvider seems not to be
implemented this way.

- I definitely need an annotation type such as SemanticLabel or SemanticCategory with
two features, namely 
type (type of the semantic label/category) and
value (type of the semantic label/category).

SemanticLabel might sound too UBY specific. However, the type would be very general:

Examples:
type=semanticField, value=location, person, ... 
type=domain, value=Computer, Education, Chemistry, ...

I tried to motivate that already in this discussion:
https://groups.google.com/forum/#!searchin/dkpro-core-developers/uby/dkpro-core-developers/_eCGNb8bUdE/gvV3loucYpAJ

but within this discussion, a kind of misunderstanding occurred.

The new annotation type I need would be quite general and not UBY-specific and not
at all related to the Types which are already available for Named Entities.

A UbySemanticLabelAnnotator will annotate the following word classes with a semantic
category or label: common nouns, main verbs, adjectives.
It will not annotate any proper nouns.

I could also introduce such an annotation type in Uby. But that might be a first step
to a parallel type system.

Best
Judith

Original issue reported on code.google.com by eckle.kohler on 2013-07-28 20:00:25

reckart commented 9 years ago
Regarding a new annotation type for semantic field information from WordNet:
This kind of lexical information is actually well established in papers that use lexical
resources for IE or Text Classification.

However, they are called differently in the literature:
- WordNet lexicographer file names (the very literal name of these tags)
- supersenses, supersense tagging
- semantic fields

I searched on the ACL anthology workbench to get some evidence:

http://aclasb.dfki.de/#txt~p|WordNet%20supersense* (17 hits)

http://aclasb.dfki.de/#txt~p|WordNet%20semantic%20field*doc~W04-0813*

They use semantic field features as well:
Dirk Hovy, Shashank Shrivastava, Sujay Kumar Jauhar, Mrinmaya Sachan, Kartik Goyal,
Huying Li, Whitney Sanders and Eduard Hovy: Identifying Metaphorical Word Use with
Tree Kernels. NAACL HLT Meta4NLP Workshop, 2013.

I used this annotation too (extensively) in recent research (with good results).

So a type SemanticField with a "value" feature might be something worth considering.

Judith

Original issue reported on code.google.com by eckle.kohler on 2013-07-31 19:34:37

reckart commented 9 years ago
Here is my plan:

- create a new package dictionaryannotator.semantictagging in the module dictionaryannotator-asl

- add to this new package: an Interface SemanticTagProvider, a UIMA resource SimpleSemanticTagProvider
and an annotator SimpleSemanticTagAnnotator that uses a key-value map as resource (retrieved
from a file). The annotator will use the Named entity type for now or another generic
one.

- add test cases for the SimpleSemanticTagAnnotator

The other side of the interface will go to UBY:

- create a new module uby.core-asl

- add resources that inherit from Resource_ImplBase and implement the SemanticTagProvider:
a UbySemanticFieldProvider, UbySemanticFrameProvider, UbyDomainProvider

- add the corresponding annotators that annotate tokens (phrases will be considered
later) with these tags
(I will use existing annotation types for now)

any objections?

Original issue reported on code.google.com by eckle.kohler on 2013-08-02 13:31:00

reckart commented 9 years ago
For the first shot, I'd suggest to keep all of the stuff in one module, either on the
Uby or on the DKPro Core side. I'd suggest dumping it into the dictionaryannotator
module right now. Moving code around to better places and/or renaming can be done when
it works.

Original issue reported on code.google.com by richard.eckart on 2013-08-02 13:35:20

reckart commented 9 years ago
I finished the first round and implemented 

- SemanticTagProvider (Interface)
- NounSemanticFieldResource
- NounSemanticFieldAnnotator

and a test class for the annotator:
- NounSemanticFieldAnnotatorTest along with a tiny test resource nounSemanticFieldMapTest.txt

In the test class I use the AssertAnnotations.assertNamedEntity convenience method
from testing-asl. However, my test turned only green, when I added a modified version
of assertNamedEntity without the param. aExpectedMapped.
In my case, there is no mapping between original and DKPro-Core NE values/types.

The method I added looks like this:
public static void assertNamedEntity(String[] aExpectedOriginal,
            Collection<NamedEntity> aActual)

Isn't there a way to use the original method

assertNamedEntity(String[] aExpectedMapped, String[] aExpectedOriginal,
            Collection<NamedEntity> aActual)

in a way that does not assume a mapping? I tried several versions with aExpectedMapped
and aExpectedOriginal set to the same String[], but it did not work.

Otherwise, can I add the 

public static void assertNamedEntity(String[] aExpectedOriginal,
            Collection<NamedEntity> aActual)

to AssertAnnotations?

Judith

Original issue reported on code.google.com by eckle.kohler on 2013-08-04 11:48:46

reckart commented 9 years ago
Did you try using passing "null" as aExpectedMapped? Looking at the method, it should
ignore that argument if it is null.

Original issue reported on code.google.com by richard.eckart on 2013-08-04 11:52:23

reckart commented 9 years ago
yes, I did and it does not work:

AssertAnnotations.assertNamedEntity(null,documentNounSemanticFields,
        select(aJCas, NamedEntity.class));

yields

java.lang.NullPointerException
    at java.util.Arrays$ArrayList.<init>(Arrays.java:2842)
    at java.util.Arrays.asList(Arrays.java:2828)
    at de.tudarmstadt.ukp.dkpro.core.testing.AssertAnnotations.assertNamedEntity(AssertAnnotations.java:199)
    at de.tudarmstadt.ukp.dkpro.core.dictionaryannotator.semantictagging.NounSemanticFieldAnnotatorTest.runAnnotatorTest(NounSemanticFieldAnnotatorTest.java:109)
    at de.tudarmstadt.ukp.dkpro.core.dictionaryannotator.semantictagging.NounSemanticFieldAnnotatorTest.testGermanSeparatedParticles(NounSemanticFieldAnnotatorTest.java:37)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
    at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
    at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)

Original issue reported on code.google.com by eckle.kohler on 2013-08-04 21:30:36

reckart commented 9 years ago
I've fixed the NPE in assertNamedEntity for your case.

Original issue reported on code.google.com by richard.eckart on 2013-08-05 08:47:10

reckart commented 9 years ago
Thanks for fixing the assertNamedEntity, Richard.

I have a question regarding the key/value resource file that contains the noun lemmas
and their WordNet semantic field. Where should this resource go? Are there any naming
conventions for such files?
The size of the file is 2,3 MB

Original issue reported on code.google.com by eckle.kohler on 2013-08-05 20:05:53

reckart commented 9 years ago
I thought the idea was to access the Uby database directly?

Otherwise, I suppose this would be a resources to be packaged as a JAR file and to
go into the Maven repository.

Original issue reported on code.google.com by richard.eckart on 2013-08-05 20:15:37

reckart commented 9 years ago
>> I thought the idea was to access the Uby database directly?

right, this is the idea.

The file resource with the WordNet semantic fields just turned out to be very useful
and broadly applicable, so I extracted this information into a file for efficiency
reasons.
And thought other people might be interested in using it as well, because it does not
require to install a database.

Now I will implement 2 UBY specific pairs of resources and annotators:
- UbySemanticPredicateResource and UbySemanticPredicateAnnotator (will use the type
SemanticPredicate)
- UbyDomainLabelResource and UbyDomainLabelAnnotator (will use the type field from
api.structure)

These will access the UBY DB directly and also exploit the sense links in particular
ways.

Original issue reported on code.google.com by eckle.kohler on 2013-08-06 03:36:23

reckart commented 9 years ago
So currently, we have these build.xml files which download resources from their original
websites, package them, and upload them to our Maven repository. If there is no "original
website" for a resource, e.g. for your list, we so far host them in the downloads section
of the DKPro Core ASL google project (which will go away soon, so some different hosting
location will be required).

Original issue reported on code.google.com by richard.eckart on 2013-08-06 08:44:20

reckart commented 9 years ago
For the UBY specific resources I need to create a mapping between

- Core POS tags and UBY POS tags
- Core language information (ISO 2-letter code) and UBY language information (ISO 3-letter
code)

Is it sensible to assume that for all the POS taggers integrated in DKPro-Core (English
and German), a mapping exists that maps the original POS tags to Core POS types?

Original issue reported on code.google.com by eckle.kohler on 2013-08-10 17:12:39

reckart commented 9 years ago
German POS models usually use STTS and English POS models usually use PTB. Both are
mapped. 

Are the UBY POS tags language specific?

Original issue reported on code.google.com by richard.eckart on 2013-08-10 17:15:54

reckart commented 9 years ago
>> German POS models usually use STTS and English POS models usually use PTB. Both are
mapped. 

fine.

>> Are the UBY POS tags language specific?
No, they are designed to be language-independent. 

But a Uby-specific resource that implements the getSemanticTag method needs POS and
lemma information to access the lexical entry.

And the language information to pre-select the Uby lexicon to use.

This is important in order to throw appropriate exceptions that inform the user if
e.g. the German lexicon GermaNet is missing in UBY.

Original issue reported on code.google.com by eckle.kohler on 2013-08-10 17:28:09

reckart commented 9 years ago
Issue 169. Commited UbySemanticFieldResource, UbySemanticFieldAnnotator and UbyResourceUtils

The test class UbySemanticFieldAnnotatorTest successfully runs a test on a real (MySQL)
DB, therefore the test method is ignored.
A suitable test case for an in-memory UBY DB should be added.

Original issue reported on code.google.com by eckle.kohler on 2013-08-12 08:40:08

reckart commented 9 years ago
test case for an in-memory UBY DB was added.
see http://code.google.com/p/dkpro-core-asl/source/detail?r=1791

Original issue reported on code.google.com by eckle.kohler on 2013-08-18 13:42:26

reckart commented 9 years ago
(No text was entered with this change)

Original issue reported on code.google.com by richard.eckart on 2013-09-12 19:59:57

reckart commented 9 years ago

I think the NounSemanticFieldAnnotator and the NounSemanticFieldAnnotatorTest can be
removed.

Additional parameters that could be added to the SemanticFieldAnnotator:

- maybe language (?)
- token vs. phrase annotation

Original issue reported on code.google.com by eckle.kohler on 2013-09-14 20:40:22

reckart commented 9 years ago
(No text was entered with this change)

Original issue reported on code.google.com by richard.eckart on 2013-09-17 14:42:35

reckart commented 9 years ago
(No text was entered with this change)

Original issue reported on code.google.com by richard.eckart on 2014-03-26 10:51:39

reckart commented 9 years ago
I believe we do now have implementations of the ideas presented here on the sides of
DKPro Core in the dictionaryannotator module and on the side of Uby in the form of
resources that can be used with the dictionaryannotator code, right? If so, this could
be resolved.

Original issue reported on code.google.com by richard.eckart on 2014-05-26 22:17:49

reckart commented 9 years ago
Separate issues could be opened for specific extensions, e.g. for passing the language
through.

Original issue reported on code.google.com by richard.eckart on 2014-05-26 22:18:28

reckart commented 9 years ago
>>I believe we do now have implementations of the ideas presented here on the sides
of DKPro Core in the dictionaryannotator module and on the side of Uby in the form
of resources that can be used with the dictionaryannotator code, right?

Actually, this issue should be closed as won't fix.
Another issue could be opened titled "Tag text with information from wordlists". And
this issue can be marked as resolved.

The resource AND annotators that tag text with information from Uby have been moved
to Uby. The reason for this was the fact that Uby is not yet on Maven Central.

>> Separate issues could be opened for specific extensions, e.g. for passing the language
through.

Right.
Another extension would be to tag not only tokens, but also noun chunks.
I already have implemented that. But would need help in setting up the test case, because
last time I could not find out how chunks are composed/built in a test case.

Original issue reported on code.google.com by eckle.kohler on 2014-05-27 06:50:27

reckart commented 9 years ago
Renaming and closing as fixed.

Original issue reported on code.google.com by richard.eckart on 2014-05-27 08:08:40