codeaudit / dkpro-core-asl

Automatically exported from code.google.com/p/dkpro-core-asl
0 stars 0 forks source link

Tag text with information from wordlists #169

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
I moved that discussion to an own issue.

Original question was:

---
this sounds very useful and important. Could such a type be used for tagging 
text with UBY-"tags"?
E.g., with the "TagSet" type version, that would be something like "name"= 
ubySemanticTag and "layer" = semantics

---

a) You want to annotate specific things with Uby as a data source. Here it 
depends what you want to do. If you want to annotate lemmas, use the Lemma 
annotation. If you want to annotate things from Uby where we do not have 
support for yet, this needs to be discussed. However, Uby will probably be only 
one possible data source for such information.

b) You want Uby specific stuff. This could rather reside in the Uby repository 
with a special Uby annotation type that e.g. holds an id which can be used to 
access all the wealth of Uby information if you need it.

Original issue reported on code.google.com by torsten....@gmail.com on 26 Jun 2013 at 10:46

GoogleCodeExporter commented 9 years ago
The simplest case of annotating text with Uby information is to annotate tokens 
(based on their lemmas).

There is a lot that can be annotated at the token level. If you just consider 
semantic tags, a wide variety of different "semantic tagsets" can be derived 
from Uby and used for tagging.

Therefore, my impression was that it might be useful to keep information of the 
specific "semantic Uby tagset" used for tagging.

>> However, Uby will probably be only one possible data source for such 
information.

Sure, the information that is annotated is not Uby-specific at all. I just 
mention Uby here, because it is the only lexical resource I am working with 
(quite ok, since it contains 10 lexical resources ...)
So there is no need to mention Uby anywhere in the type names.

>> You want Uby specific stuff. 
Actually, I can not think of any Uby-specific stuff to annotate. All that Uby 
provides is ordinary lexical information, but at a scale that is typically not 
reachable by single lexical resources.

Original comment by eckle.kohler on 26 Jun 2013 at 11:33

GoogleCodeExporter commented 9 years ago
Would this require to disambiguate first?
I guess that semantic tags are quite specific to senses.

Original comment by torsten....@gmail.com on 26 Jun 2013 at 11:39

GoogleCodeExporter commented 9 years ago
That depends on the specific semantic tagset used for annotating. 

There are cases where disambiguation is not necessary or very simple.
For other semantic tags, the annotator might have to perform some kind of WSD.

Original comment by eckle.kohler on 26 Jun 2013 at 11:51

GoogleCodeExporter commented 9 years ago
Great. I am looking forward to the prototype.

Original comment by torsten....@gmail.com on 26 Jun 2013 at 11:52

GoogleCodeExporter commented 9 years ago
I wonder how we'll do the interfacing between DKPro Core and Uby:

a) have a "uby" module in DKPro Core with a couple of annotators
b) have a "uima" module in Uby with a couple of annotators
c) define resource APIs (e.g. "Dictionary") and generic annotators (e.g. 
"DictionaryAnnotator)" in DKPro Core and provide implementations of that in Uby.

I think "c" would definitely be the coolest one.

Original comment by richard.eckart on 30 Jun 2013 at 5:12

GoogleCodeExporter commented 9 years ago
I also like c) as it aligns best with the "Uby is a excellent source for 
information xyz, but certainly not the only one" paradigm discussed above.

Original comment by torsten....@gmail.com on 30 Jun 2013 at 5:57

GoogleCodeExporter commented 9 years ago
c) +1

BTW: does this still fit with a UbyResourceLocator in uby? (which is living 
there already in a uima module created today)

Original comment by eckle.kohler on 30 Jun 2013 at 6:00

GoogleCodeExporter commented 9 years ago
Sure, why not. I imagine for somebody wanting to code a custom component (not 
resource) using Uby, the locator should be convenient.

At this point, I couldn't say it would be more convenient if a hypothetical 
"UbyDictionary" would use it or if it would have its own internal Uby instance. 

Original comment by richard.eckart on 30 Jun 2013 at 6:07

GoogleCodeExporter commented 9 years ago
I have a couple of questions and remarks regarding the DKPro-Core part of the 
UBY-Core Interface:

- as a name for the generic interface I would prefer SemanticLabelProvider 
instead of Dictionary. I see many similarities to the FrequencyCountProvider in 
DKPro-Core, whereas Dictionary seems to be too focussed on the use of 
dictionaries in my opinion.
This interface would define a method 
String getSemanticLabel(String lemma, String POS, String semanticLabelType)

These parameters are actually necessary to implement a generic interface which 
can also be implemented by a UbySemanticLabelProvider.

Regarding the Dictionary interface in decompounding, I have a number of 
questions and comments that might be discussed elsewhere.

- Is it necessary to implement the UbySemanticLabelProvider as a UIMA resource, 
i.e. subclassing Resource_ImplBase in uimaFIT? The FrequencyCountProvider seems 
not to be implemented this way.

- I definitely need an annotation type such as SemanticLabel or 
SemanticCategory with two features, namely 
type (type of the semantic label/category) and
value (type of the semantic label/category).

SemanticLabel might sound too UBY specific. However, the type would be very 
general:

Examples:
type=semanticField, value=location, person, ... 
type=domain, value=Computer, Education, Chemistry, ...

I tried to motivate that already in this discussion:
https://groups.google.com/forum/#!searchin/dkpro-core-developers/uby/dkpro-core-
developers/_eCGNb8bUdE/gvV3loucYpAJ

but within this discussion, a kind of misunderstanding occurred.

The new annotation type I need would be quite general and not UBY-specific and 
not at all related to the Types which are already available for Named Entities.

A UbySemanticLabelAnnotator will annotate the following word classes with a 
semantic category or label: common nouns, main verbs, adjectives.
It will not annotate any proper nouns.

I could also introduce such an annotation type in Uby. But that might be a 
first step to a parallel type system.

Best
Judith

Original comment by eckle.kohler on 28 Jul 2013 at 8:00

GoogleCodeExporter commented 9 years ago
Regarding a new annotation type for semantic field information from WordNet:
This kind of lexical information is actually well established in papers that 
use lexical resources for IE or Text Classification.

However, they are called differently in the literature:
- WordNet lexicographer file names (the very literal name of these tags)
- supersenses, supersense tagging
- semantic fields

I searched on the ACL anthology workbench to get some evidence:

http://aclasb.dfki.de/#txt~p|WordNet%20supersense* (17 hits)

http://aclasb.dfki.de/#txt~p|WordNet%20semantic%20field*doc~W04-0813*

They use semantic field features as well:
Dirk Hovy, Shashank Shrivastava, Sujay Kumar Jauhar, Mrinmaya Sachan, Kartik 
Goyal, Huying Li, Whitney Sanders and Eduard Hovy: Identifying Metaphorical 
Word Use with Tree Kernels. NAACL HLT Meta4NLP Workshop, 2013.

I used this annotation too (extensively) in recent research (with good results).

So a type SemanticField with a "value" feature might be something worth 
considering.

Judith

Original comment by eckle.kohler on 31 Jul 2013 at 7:34

GoogleCodeExporter commented 9 years ago
Here is my plan:

- create a new package dictionaryannotator.semantictagging in the module 
dictionaryannotator-asl

- add to this new package: an Interface SemanticTagProvider, a UIMA resource 
SimpleSemanticTagProvider and an annotator SimpleSemanticTagAnnotator that uses 
a key-value map as resource (retrieved from a file). The annotator will use the 
Named entity type for now or another generic one.

- add test cases for the SimpleSemanticTagAnnotator

The other side of the interface will go to UBY:

- create a new module uby.core-asl

- add resources that inherit from Resource_ImplBase and implement the 
SemanticTagProvider: a UbySemanticFieldProvider, UbySemanticFrameProvider, 
UbyDomainProvider

- add the corresponding annotators that annotate tokens (phrases will be 
considered later) with these tags
(I will use existing annotation types for now)

any objections?

Original comment by eckle.kohler on 2 Aug 2013 at 1:31

GoogleCodeExporter commented 9 years ago
For the first shot, I'd suggest to keep all of the stuff in one module, either 
on the Uby or on the DKPro Core side. I'd suggest dumping it into the 
dictionaryannotator module right now. Moving code around to better places 
and/or renaming can be done when it works.

Original comment by richard.eckart on 2 Aug 2013 at 1:35

GoogleCodeExporter commented 9 years ago
I finished the first round and implemented 

- SemanticTagProvider (Interface)
- NounSemanticFieldResource
- NounSemanticFieldAnnotator

and a test class for the annotator:
- NounSemanticFieldAnnotatorTest along with a tiny test resource 
nounSemanticFieldMapTest.txt

In the test class I use the AssertAnnotations.assertNamedEntity convenience 
method from testing-asl. However, my test turned only green, when I added a 
modified version of assertNamedEntity without the param. aExpectedMapped.
In my case, there is no mapping between original and DKPro-Core NE values/types.

The method I added looks like this:
public static void assertNamedEntity(String[] aExpectedOriginal,
            Collection<NamedEntity> aActual)

Isn't there a way to use the original method

assertNamedEntity(String[] aExpectedMapped, String[] aExpectedOriginal,
            Collection<NamedEntity> aActual)

in a way that does not assume a mapping? I tried several versions with 
aExpectedMapped and aExpectedOriginal set to the same String[], but it did not 
work.

Otherwise, can I add the 

public static void assertNamedEntity(String[] aExpectedOriginal,
            Collection<NamedEntity> aActual)

to AssertAnnotations?

Judith

Original comment by eckle.kohler on 4 Aug 2013 at 11:48

GoogleCodeExporter commented 9 years ago
Did you try using passing "null" as aExpectedMapped? Looking at the method, it 
should ignore that argument if it is null.

Original comment by richard.eckart on 4 Aug 2013 at 11:52

GoogleCodeExporter commented 9 years ago
yes, I did and it does not work:

AssertAnnotations.assertNamedEntity(null,documentNounSemanticFields,
        select(aJCas, NamedEntity.class));

yields

java.lang.NullPointerException
    at java.util.Arrays$ArrayList.<init>(Arrays.java:2842)
    at java.util.Arrays.asList(Arrays.java:2828)
    at de.tudarmstadt.ukp.dkpro.core.testing.AssertAnnotations.assertNamedEntity(AssertAnnotations.java:199)
    at de.tudarmstadt.ukp.dkpro.core.dictionaryannotator.semantictagging.NounSemanticFieldAnnotatorTest.runAnnotatorTest(NounSemanticFieldAnnotatorTest.java:109)
    at de.tudarmstadt.ukp.dkpro.core.dictionaryannotator.semantictagging.NounSemanticFieldAnnotatorTest.testGermanSeparatedParticles(NounSemanticFieldAnnotatorTest.java:37)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
    at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
    at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)

Original comment by eckle.kohler on 4 Aug 2013 at 9:30

GoogleCodeExporter commented 9 years ago
I've fixed the NPE in assertNamedEntity for your case.

Original comment by richard.eckart on 5 Aug 2013 at 8:47

GoogleCodeExporter commented 9 years ago
Thanks for fixing the assertNamedEntity, Richard.

I have a question regarding the key/value resource file that contains the noun 
lemmas and their WordNet semantic field. Where should this resource go? Are 
there any naming conventions for such files?
The size of the file is 2,3 MB

Original comment by eckle.kohler on 5 Aug 2013 at 8:05

GoogleCodeExporter commented 9 years ago
I thought the idea was to access the Uby database directly?

Otherwise, I suppose this would be a resources to be packaged as a JAR file and 
to go into the Maven repository.

Original comment by richard.eckart on 5 Aug 2013 at 8:15

GoogleCodeExporter commented 9 years ago
>> I thought the idea was to access the Uby database directly?

right, this is the idea.

The file resource with the WordNet semantic fields just turned out to be very 
useful and broadly applicable, so I extracted this information into a file for 
efficiency reasons.
And thought other people might be interested in using it as well, because it 
does not require to install a database.

Now I will implement 2 UBY specific pairs of resources and annotators:
- UbySemanticPredicateResource and UbySemanticPredicateAnnotator (will use the 
type SemanticPredicate)
- UbyDomainLabelResource and UbyDomainLabelAnnotator (will use the type field 
from api.structure)

These will access the UBY DB directly and also exploit the sense links in 
particular ways.

Original comment by eckle.kohler on 6 Aug 2013 at 3:36

GoogleCodeExporter commented 9 years ago
So currently, we have these build.xml files which download resources from their 
original websites, package them, and upload them to our Maven repository. If 
there is no "original website" for a resource, e.g. for your list, we so far 
host them in the downloads section of the DKPro Core ASL google project (which 
will go away soon, so some different hosting location will be required).

Original comment by richard.eckart on 6 Aug 2013 at 8:44

GoogleCodeExporter commented 9 years ago
For the UBY specific resources I need to create a mapping between

- Core POS tags and UBY POS tags
- Core language information (ISO 2-letter code) and UBY language information 
(ISO 3-letter code)

Is it sensible to assume that for all the POS taggers integrated in DKPro-Core 
(English and German), a mapping exists that maps the original POS tags to Core 
POS types?

Original comment by eckle.kohler on 10 Aug 2013 at 5:12

GoogleCodeExporter commented 9 years ago
German POS models usually use STTS and English POS models usually use PTB. Both 
are mapped. 

Are the UBY POS tags language specific?

Original comment by richard.eckart on 10 Aug 2013 at 5:15

GoogleCodeExporter commented 9 years ago
>> German POS models usually use STTS and English POS models usually use PTB. 
Both are mapped. 

fine.

>> Are the UBY POS tags language specific?
No, they are designed to be language-independent. 

But a Uby-specific resource that implements the getSemanticTag method needs POS 
and lemma information to access the lexical entry.

And the language information to pre-select the Uby lexicon to use.

This is important in order to throw appropriate exceptions that inform the user 
if e.g. the German lexicon GermaNet is missing in UBY.

Original comment by eckle.kohler on 10 Aug 2013 at 5:28

GoogleCodeExporter commented 9 years ago
Issue 169. Commited UbySemanticFieldResource, UbySemanticFieldAnnotator and 
UbyResourceUtils

The test class UbySemanticFieldAnnotatorTest successfully runs a test on a real 
(MySQL) DB, therefore the test method is ignored.
A suitable test case for an in-memory UBY DB should be added.

Original comment by eckle.kohler on 12 Aug 2013 at 8:40

GoogleCodeExporter commented 9 years ago
test case for an in-memory UBY DB was added.
see http://code.google.com/p/dkpro-core-asl/source/detail?r=1791

Original comment by eckle.kohler on 18 Aug 2013 at 1:42

GoogleCodeExporter commented 9 years ago

Original comment by richard.eckart on 12 Sep 2013 at 7:59

GoogleCodeExporter commented 9 years ago

I think the NounSemanticFieldAnnotator and the NounSemanticFieldAnnotatorTest 
can be removed.

Additional parameters that could be added to the SemanticFieldAnnotator:

- maybe language (?)
- token vs. phrase annotation

Original comment by eckle.kohler on 14 Sep 2013 at 8:40

GoogleCodeExporter commented 9 years ago

Original comment by richard.eckart on 17 Sep 2013 at 2:42

GoogleCodeExporter commented 9 years ago

Original comment by richard.eckart on 26 Mar 2014 at 10:51

GoogleCodeExporter commented 9 years ago
I believe we do now have implementations of the ideas presented here on the 
sides of DKPro Core in the dictionaryannotator module and on the side of Uby in 
the form of resources that can be used with the dictionaryannotator code, 
right? If so, this could be resolved.

Original comment by richard.eckart on 26 May 2014 at 10:17

GoogleCodeExporter commented 9 years ago
Separate issues could be opened for specific extensions, e.g. for passing the 
language through.

Original comment by richard.eckart on 26 May 2014 at 10:18

GoogleCodeExporter commented 9 years ago
>>I believe we do now have implementations of the ideas presented here on the 
sides of DKPro Core in the dictionaryannotator module and on the side of Uby in 
the form of resources that can be used with the dictionaryannotator code, right?

Actually, this issue should be closed as won't fix.
Another issue could be opened titled "Tag text with information from 
wordlists". And this issue can be marked as resolved.

The resource AND annotators that tag text with information from Uby have been 
moved to Uby. The reason for this was the fact that Uby is not yet on Maven 
Central.

>> Separate issues could be opened for specific extensions, e.g. for passing 
the language through.

Right.
Another extension would be to tag not only tokens, but also noun chunks.
I already have implemented that. But would need help in setting up the test 
case, because last time I could not find out how chunks are composed/built in a 
test case.

Original comment by eckle.kohler on 27 May 2014 at 6:50

GoogleCodeExporter commented 9 years ago
Renaming and closing as fixed.

Original comment by richard.eckart on 27 May 2014 at 8:08