dkpro / dkpro-core

Collection of software components for natural language processing (NLP) based on the Apache UIMA framework.
https://dkpro.github.io/dkpro-core
Other
196 stars 67 forks source link

self-built TreeTagger does not assign POS Subtypes #402

Closed reckart closed 9 years ago

reckart commented 9 years ago
Hi,

I'm using a self-built TreeTagger model as described on your homepage. However, it
only assigns the 'POS' tag as annotations, but not the subtypes (N, NN,...).

Yours,
Laura

Original issue reported on code.google.com by Steinert.Laura on 2014-06-04 09:38:00

reckart commented 9 years ago
If you explicitly specify a model using PARAM_MODEL_LOCATION or PARAM_MODEL_PATH (depending
on the DKPro Core version), you should specify a mapping file using PARAM_TAGGER_MAPPING_LOCATION.

Maybe there is already a suitable mapping file included with DKPro Core that you could
use. Are you using a standard tagset, if so, which one?

Original issue reported on code.google.com by richard.eckart on 2014-06-04 09:47:33

reckart commented 9 years ago
I was unable to give the model explicitely via a variable (it's a project which is called
from another project...). Therefore, I copied all the built TreeTagger files into my
resources folder.

However, I don't think that I have a mapping file (neither included in DKPro Core nor
in the built TreeTagger files). Where could I get one and how would I have to specify
it in my sourcecode/project properties? I want to get the POS tags.

Original issue reported on code.google.com by Steinert.Laura on 2014-06-05 13:13:19

reckart commented 9 years ago
I need to better understand what you did and what you are trying to do.

If you just want to access the pos tags that the treetagger produces, that is easy.
To get the POS tag of a token, you can do this:

token.getPos().getPosValue()

If this is all you want, you can stop here.

A mapping is only required if you want to use the coarse-grained POS types that you
could use in a statement like

JCasUtil.select(jcas, N.class)

Apparently you already managed to train a model and to instruct the DKPro Core treetagger
component to use it. To help you further, I would need to know how you configured the
TreeTagger component to use your model. E.g. what exactly are you referring to when
you say that you used a self-built model as described on our homepage, and I would
need to know how you configure and invoke the TreeTaggerPosLemmaTT4J component.

Original issue reported on code.google.com by richard.eckart on 2014-06-05 20:23:08

reckart commented 9 years ago
Okay, I want to use a keyphrase extractor from DKPro Keyphrases, e.g. the PositionBaseline
Extractor. These, however, need the POS tags to filter the tokens.
Here's an example code snippet:

Candidate nounTokens = new Candidate(CandidateType.Token, PosType.N);
KeyphraseExtractor_ImplBase positionBaselineExtractor = new PositionBaselineExtractor();
positionBaselineExtractor.setCandidate(nounTokens);
AnalysisEngine extractor = positionBaselineExtractor.getKeyphraseEngine();        
JCas jcas = extractor.newJCas();
jcas.setDocumentText(text);        
extractor.process(jcas);       
JCasUtil.select(jcas, Keyphrase.class);

This code uses the TreeTagger. In theory it should only return nouns as keyphrases,
however I receive all words of the input text regardless of POS tag.
Therefore, I checked what POS tags the jcas holds with:

for (POS pos : JCasUtil.select(jcas, POS.class)) {
    System.out.println(pos);
}

And this gives me only 'POS' as tags. Hence, it does not know any subtypes, such as
'NN' or 'N'. If I change the code of DKPro Core's class 'KeyphraseExtractor_ImplBase's
createTagger method to use a OpenNlpPosTagger instead, the PositionBaseline extractor
works the way it should.

I built the TreeTagger as described here:
http://code.google.com/p/dkpro-core-asl/wiki/PackagingResources

Original issue reported on code.google.com by Steinert.Laura on 2014-06-06 08:13:52

reckart commented 9 years ago
Info about DKPro Keyphrases

In KeyphraseExtractor_ImplBase the TreeTagger is invoked like that:

return createEngineDescription(
                    TreeTaggerChunkerTT4J.class,
                    TreeTaggerChunkerTT4J.PARAM_LANGUAGE, getLanguage()
            );

So the model is not explicitly added, but loaded via the CAS language.

Original issue reported on code.google.com by torsten.zesch on 2014-06-06 08:30:12

reckart commented 9 years ago
You said that you are using a self-trained treetagger model. Is it correct that you
extended the build.xml file to package your own model as a jar? (cf. [1])

[1] https://code.google.com/p/dkpro-core-asl/wiki/ResourceProviderAPI#Packaging_resources_as_JARs

Original issue reported on code.google.com by richard.eckart on 2014-06-06 22:09:14

reckart commented 9 years ago
No, that is a misunderstanding. I used the build.xml as given by the project. I do not
use any self-trained model. All I did was packaging the TreeTagger with the build.xml
and copying it into the resource folder of my own project.

Original issue reported on code.google.com by Steinert.Laura on 2014-06-10 08:24:15

reckart commented 9 years ago
The build.xml file creates various JARs in the folder called "target". Instead of copying
these JARs to your resources folder, add them to you classpath. In Eclipse you can
do this e.g. by right-clicking on them and select "Build path -> Add to build path".
Alternatively, they can be added via Maven.

If you are using DKPro Core 1.5.x, there should be one JAR per language. If you add
that, it should also give you the pos tag mapping.

If you are using DKPro Core 1.6.x, there should be two JARs per model, one "model"
JAR and one "upstream" JAR. Make sure to have added both to the classpath in order
to get the mapping.

If this does not help:
There might not a mapping for all languages and all models. Which language are you
processing?

Original issue reported on code.google.com by richard.eckart on 2014-06-10 09:05:06

reckart commented 9 years ago
I now added the treetagger-bin jar as well as the treetagger-model-en jar to my build
path. However, the problem remains. I am processing english texts.

Original issue reported on code.google.com by Steinert.Laura on 2014-06-10 13:18:41

reckart commented 9 years ago
Same problem here. When i try to get the coarse-grained POS tags with TreeTagger, i
only get "POS" instead of "N" or "V". I am trying to print the tags as the following:

for (Token tokenAnno : JCasUtil.select(jcas, Token.class)) {
           System.out.println(tokenAnno.getPos().getClass().getSimpleName());

The whole thing works for german language, but does not work for english language.
It does not even load the tagsets. The only way to get coarse-grained tags for english
language is mapping TreeTagger to a tagset (e.g. "en-pos.map"). 

Maybe you, Steinert can test it for german language? If this works, then maybe there
are problems with english texts and TreeTagger..

Original issue reported on code.google.com by onurs3232 on 2014-06-10 18:42:28

reckart commented 9 years ago
I'll look into it. Can you please tell me which version of DKPro Core you are using
(mind that -all- DKPro Core JARs in your projects should have the same version - you
should not mix versions) and what are the full JAR names (including the version) of
the model files that you are using. If you know, it might also be helpful to know the
URL/svn revision of the build.xml files that you used.

Original issue reported on code.google.com by richard.eckart on 2014-06-10 20:54:02

reckart commented 9 years ago
Okay, here come the SVN revisions I'm using:

build.xml: 25
de.tudarmstadt.ukp.dkpro.core.treetagger-asl: 2281

The versions I use in my POM: de.tudarmstadt.ukp.dkpro.core.treetagger-asl: 1.5.0 (although
I checked the project out locally to build it, I still have the POM set to use a version
via a repository).

The names of the jars:
treetagger-bin-20131118.0.jar
treetagger-model-en-20111109.1.jar

Original issue reported on code.google.com by Steinert.Laura on 2014-06-11 07:56:34

reckart commented 9 years ago
Update: It's also working for me when using German texts.

Original issue reported on code.google.com by Steinert.Laura on 2014-06-11 08:18:53

reckart commented 9 years ago
treetagger-model-en-20111109.1.jar is a model for DKPro Core 1.6.0 [1]. It declares
the tagset "ptb-tt" which is not known to DKPro Core 1.5.0. Hence, DKPro Core 1.5.0
falls back to mapping every tag to the POS annotation type (and storing the actual
pos-tag only in the posValue feature of the POS annotation.

You should use the build.xml file for DKPro Core 1.5.0 [2] which declares the "ptb"
tagset. 

No matter what build.xm file you use, you might find that some models have meanwhile
been updated on the TreeTagger homepage and some md5 hashes may no longer match that
build.xml file. Since TreeTagger models and binaries cannot be redistributed due to
license restrictions, this should not cause problems. If you care about versioning
and want to stay clear of potential version conflicts with future build.xml files,
I would recommend you add some suffix to the version, e.g. "20111109.0-steinert".

TreeTagger model packaging will change in DKPro Core 1.7.0 and then follow the packaging
conventions also used for other models/resources.

I assume this should resolve your problem. I am marking this issue as "invalid" because
it does not require changes or further actions on our part. If your problems is not
resolved or if you feel that further action on our part is necessary, feel free to
comment and reopen the issue.

[1] https://dkpro-core-asl.googlecode.com/svn/de.tudarmstadt.ukp.dkpro.core-asl/tags/de.tudarmstadt.ukp.dkpro.core-asl-1.6.0/de.tudarmstadt.ukp.dkpro.core.treetagger-asl/src/scripts/build.xml
[2] https://dkpro-core-asl.googlecode.com/svn/de.tudarmstadt.ukp.dkpro.core-asl/tags/de.tudarmstadt.ukp.dkpro.core-asl-1.5.0/de.tudarmstadt.ukp.dkpro.core.treetagger-asl/src/scripts/build.xml

Original issue reported on code.google.com by richard.eckart on 2014-06-11 19:28:30