dkpro / dkpro-core

Collection of software components for natural language processing (NLP) based on the Apache UIMA framework.
https://dkpro.github.io/dkpro-core
Other
195 stars 67 forks source link

#1454 - German OpenNLP chunker model #1459

Closed aggarwalpiush closed 1 year ago

aggarwalpiush commented 4 years ago

PR is created

Kindly review.

ukp-svc-jenkins commented 4 years ago

Can one of the admins verify this patch?

reckart commented 4 years ago

@aggarwalpiush I'm looking into the PR - and also trying to add a few additional IXA models in the process.

In particular your chunker model is giving problems though - because of the SSL certificate used on your webserver. I get this error when trying to download your model using the build script:

javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

See also: https://www.sslshopper.com/ssl-checker.html#hostname=https://www.ltl.uni-due.de/content/6-software/de-chunker-opennlp.bin

reckart commented 4 years ago

I had a look at the IXA models - these models that come from the morph-models-1.5.0 are not meant to be used with OpenNLP but rather with the IXA pipes from the IXA module we also have. So instead of updating the build scripts for the models from the morph-models-1.5.0 we should actually drop them from the OpenNLP module. They are already included in the IXA module.

reckart commented 4 years ago

Cf. https://github.com/dkpro/dkpro-core/issues/1465

aggarwalpiush commented 4 years ago

@aggarwalpiush I'm looking into the PR - and also trying to add a few additional IXA models in the process.

In particular your chunker model is giving problems though - because of the SSL certificate used on your webserver. I get this error when trying to download your model using the build script:

javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

See also: https://www.sslshopper.com/ssl-checker.html#hostname=https://www.ltl.uni-due.de/content/6-software/de-chunker-opennlp.bin

@reckart I have added the missing certificates to LTL webserver. Could you please try to download models using build script from the webserver.

aggarwalpiush commented 4 years ago

Cf. #1465

I can see that in this issue, ixa models are already removed from the build script. If everything works now, can we merge the changes?

reckart commented 4 years ago

I have updated this PR with a couple of changes - please have a look.

I assume the POS tags used to train the model were of the STTS tagset?

Are the chunk tags a part of the TIGER corpus? Where are they documented? I didn't find a documentation of them in the syntax annotation guidelines.

reckart commented 4 years ago

I cannot upload the model still because the maven-ant-task tries to access Maven Central in the process and does so via http - but http is no longer supported, only https. We either need to figure out how to reconfigure it to use https or maybe switch to another newer ant maven task because the one we currently use is deprecated.

aggarwalpiush commented 4 years ago

I have updated this PR with a couple of changes - please have a look.

@reckart changes looks good to me

I assume the POS tags used to train the model were of the STTS tagset?

Are the chunk tags a part of the TIGER corpus? Where are they documented? I didn't find a documentation of them in the syntax annotation guidelines.

@mariebexte could you please provide these details

aggarwalpiush commented 4 years ago

I cannot upload the model still because the maven-ant-task tries to access Maven Central in the process and does so via http - but http is no longer supported, only https. We either need to figure out how to reconfigure it to use https or maybe switch to another newer ant maven task because the one we currently use is deprecated.

As I don't have issue management read access for apache ant task, I believe, from release 2.0.7, this issue is resolved in bug MANTTASKS-11. Can we check, if this release is the right version that solves this issue?

reckart commented 4 years ago

I have seen MANTTASKS-11 but I wasn't able yet to look how to actually configure an alternative URL for maven central. And then I noticed that the tasks we use is outdated anyway and that https://maven.apache.org/resolver-ant-tasks/ is the new replacement - so I wasn't sure whether investigating fixing the setup with the old tasks is worth it.

Do you want to investigate?

mariebexte commented 4 years ago

I assume the POS tags used to train the model were of the STTS tagset? Are the chunk tags a part of the TIGER corpus? Where are they documented? I didn't find a documentation of them in the syntax annotation guidelines.

@mariebexte could you please provide these details

POS tags are STSS.

As for the chunk tags, these are part of the TIGER corpus. The pdf including their documentation comes with downloading the corpus, but can also be found online here.

reckart commented 4 years ago

@mariebexte I didn't read the full documentation - I only searched for "chunk" and that was the only sentence I found:

Fremdsprachliche Zitate werden als Chunks (CH) flach annotiert; die einzelnen Komponenten erhalten das Label UC (“unit component”).

It seems to me that the chunks are some kind of projection of the phrase categories to the word level that you did yourself?

mariebexte commented 4 years ago

It seems to me that the chunks are some kind of projection of the phrase categories to the word level that you did yourself?

Yes, you're right. We used NPs, VPs and PPs for the respective chunks.

This meant giving each token a B-[NP|VP|PP] (beginning of chunk), I-[NP|VP|PP] (continuation of chunk) or O (not part of a chunk) annotation, derived from the NP, VP and PP annotations in TIGER. So, if TIGER annotated two tokens A and B as a NP, we would annotate A as the beginnig of the chunk (B-NP) and B as I-NP.

reckart commented 4 years ago

@mariebexte I added a unit test with your model. Here are the results:

Text: Wir brauchen ein sehr kompliziertes Beispiel , welches möglichst viele Konstituenten und Dependenzen beinhaltet .

                "[  0,  3]NC(NP) (Wir)",
                "[  4, 12]VC(VP) (brauchen)",
                "[ 13, 16]NC(NP) (ein)",
                "[ 36, 44]NC(NP) (Beispiel)",
                "[ 47, 54]NC(NP) (welches)",
                "[ 55, 64]VC(VP) (möglichst)",
                "[ 65, 70]NC(NP) (viele)",
                "[ 71, 84]NC(NP) (Konstituenten)",
                "[ 89,100]NC(NP) (Dependenzen)",
                "[101,111]VC(VP) (beinhaltet)"

It would seem as if something with the BIO encoding went wrong during training because the chunks all appear to be single-word chunks. Also, not all words are included in the a chunk? Normally, a chunk consisting as multiple words should be returned as a larger span from the OpenNLP chunker. For comparison, here an example from another model (en, perceptron-ixa):

Text: We need a very complicated example sentence, which contains as many constituents and dependencies as possible.

                "[  0,  2]NC(NP) (We)",
                "[  3,  7]VC(VP) (need)",
                "[  8, 43]NC(NP) (a very complicated example sentence)",
                "[ 45, 50]NC(NP) (which)",
                "[ 51, 59]VC(VP) (contains)",
                "[ 60, 62]O(SBAR) (as)",
                "[ 63, 97]NC(NP) (many constituents and dependencies)",
                "[ 98,100]PC(PP) (as)",
                "[101,109]ADJC(ADJP) (possible)"
mariebexte commented 4 years ago

It would seem as if something with the BIO encoding went wrong during training because the chunks all appear to be single-word chunks. Also, not all words are included in the a chunk?

When I chunk the same sentence in the command line (using opennlp POSTagger de-pos-maxent and then chunking with opennlp ChunkerME and the model) tokens that are not part of a chunk are returned:

[NP Wir_PPER ]
[VP brauchen_VVFIN ]
[NP ein_ART ]
sehr_ADV
kompliziertes_ADJA
[NP Beispiel_NN ]
,_$,
[NP welches_PRELS ]
[VP möglichst_VVFIN ]
[NP viele_PIAT ]
[NP Konstituenten_NN ]
und_KON
[NP Dependenzen_NN ]
[VP beinhaltet_VVFIN ]
._$.

I agree that it is not desired to end up with this many single-word chunks, so I'll have to dig into TIGER to see whether that's an issue caused by how it was annotated or if something went wrong with our BIO-tags. In general, the model is capable of returning multi-word chunks:

Text: Wir brauchen kein einfaches Beispiel .
[NP Wir_PPER ] 
[VP brauchen_VVFIN ] 
[NP kein_PIAT einfaches_ADJA Beispiel_NN ] 
._$.
reckart commented 4 years ago

Do you want to provide an updated model or should it be merged as it is?

mariebexte commented 4 years ago

Sorry for not getting back to you earlier.

I am afraid the results we discussed are due to how phrases are annotated in TIGER, hence I won‘t be able to provide an updated model.

reckart commented 4 years ago

Jenkins, can you test this please?