dkpro / dkpro-core

Collection of software components for natural language processing (NLP) based on the Apache UIMA framework.
https://dkpro.github.io/dkpro-core
Other
196 stars 67 forks source link

Integrate Bart Coreference version 2.0 #258

Closed reckart closed 1 year ago

reckart commented 9 years ago
I got this mail via the corpora list and it sound interesting

---

From: Olga Uryupina <uryupina@gmail.com>
Subject: [Corpora-List] BART coreference resolver: v2.0 released
To: "corpora@uib.no" <corpora@uib.no>

Dear CorporaList members,

BART (Beautiful Anaphora Resolution Toolkit) version 2.0 has been
released. This new version of the BART toolkit for developing anaphora
resolution systems is much improved over the previous semi-official
version released in 2008. Please visit the BART website to get the
latest version:

www.bart-anaphora.org

Please note the new name of our website! Unfortunately, the old
website (bart-coref.org) has been taken over by a third party, so
please remove it from your bookmarks.

More on BART:

BART, the Beautiful Anaphora Resolution Toolkit, was initially
developed during the project Exploiting Lexical and Encyclopedic
Resources For Entity Disambiguation at the Johns Hopkins Summer
Workshop 2007, but has since been constantly revised. BART performs
automatic coreference resolution (Multilingual), including all
necessary preprocessing steps to operate on the raw text input (for
English only). BART incorporates a variety of machine learning
approaches and can use several machine learning toolkits, including
WEKA and an included MaxEnt implementation. BART has shown a
state-of-the-art performance at recent evaluation campaigns
(SemEval-2010 Task 1, CoNLL-2011, CoNLL-2012)

BART is an open-source java toolkit and runs on Linux, MacOS and Windows.

What's new in the second release:

- 2 models for the out-of-the-box coreference for English (trained on OntoNotes)
- Generic Language Plugin supports coreference resolution in your
favourite language
- Italian and German Language Plugins
- More features
- More Preprocessing Pipelines
- Tabular import/export format
- Presets for straightforward testing
- Bugs fixed

BART development team

Original issue reported on code.google.com by nico.erbs on 2013-10-01 12:36:28

reckart commented 9 years ago
I downloaded it.
For English, it does not seem to make sense to integrate it in DKPro Core, since breaking
up the BART end-to-end system for English will lead to a decrease in performance.

For other languages (e.g. German), quite some effort seems to be required to ensure
the minimum level of preprocessing needed to run BART.

Original issue reported on code.google.com by eckle.kohler on 2013-10-01 13:10:00

reckart commented 9 years ago
I don't understand why it does not make sense to integrate it.

Original issue reported on code.google.com by torsten.zesch on 2013-10-01 13:12:19

reckart commented 9 years ago
*SEEM*

Original issue reported on code.google.com by eckle.kohler on 2013-10-01 13:15:08

reckart commented 9 years ago
Ok, let me rephrase:
As you downloaded the tool and had a look, why do you think it does not _seem_ to make
sense to integrate it.

Original issue reported on code.google.com by torsten.zesch on 2013-10-01 13:20:51

reckart commented 9 years ago
I can imagine that it would be quite some effort to integrate it. But I'd rather be
curious why a decrease in performance is to be expected?  I assume you mean "performance"
in the sense of quality, not speed. Except for the effort, I think it would be nice
in any case to integrate it, because we simply don't have much for coreference yet.
I think (hope) that most of the preprocessing required should already be present in
DKPro Core.

Original issue reported on code.google.com by richard.eckart on 2013-10-01 13:37:34

reckart commented 9 years ago
Torsten wrote:
>> why do you think it does not _seem_ to make sense to integrate it.
My comment relates to the effort associated with integrating it. For English, we also
have the Stanford Coreference Resolver which is state.of-the-art.
So the question is: is it worth the effort for English?

Richard wrote:
>> I assume you mean "performance" in the sense of quality, not speed.
right

>> I think (hope) that most of the preprocessing required should already be present
in DKPro Core.

For German, the morphosyntactic preprocessing is not well covered right now.
BART makes heavy use of the morphosyntactic properties gender, number and case.

I think, we would have to create an appropriate type for this kind of information first.

Original issue reported on code.google.com by eckle.kohler on 2013-10-03 20:02:11

reckart commented 9 years ago
Some background on the performance on other languages:

"a language-agnostic
system (designed primarily for English) can achieve a per-
formance level in high forties (MUC F-score) when re-
trained and tested on a new language, at least on gold
mention boundaries. Though this number might appear
low, note that it is a baseline requiring no extra engineer-
ing." see http://www.lrec-conf.org/proceedings/lrec2010/pdf/755_Paper.pdf

This performance is really low. I would not want to use such a component.

Original issue reported on code.google.com by eckle.kohler on 2013-10-03 20:06:23

reckart commented 9 years ago
I believe the Stanford coreferencer also uses such information (gender, etc.) but it
brings its own resources for these things. I'd tend to try feed Bart with anything
that we already can produce (token, pos, lemma, named entity, etc.) and let Bart handle
all the things we cannot yet produce (e.g. gender and so on). Getting anything to work
would already be quite nice, factoring out additional steps could happen afterwards.

Btw. I also downloaded Bart 2.0 and had a brief look. I still have no idea where to
hook in :( It seems the code does not only contain a single component or pipeline,
but is rather a full coreference construction kit with many things that one actually
wouldn't need only to do the "default" coref resolution.

Original issue reported on code.google.com by richard.eckart on 2013-10-03 20:13:01

reckart commented 9 years ago
As far as I understand, regarding German, BART does not bring the preprocessing resources:

README:
"We do not support preprocessing for languages other than English. So, to run BART
on another language, you first have to preprocess your data yourself, generating all
the necessary markable levels, including the "markable" level that contains info on
the mentions. In sample/generic-min, we show the minimal amount of information to be
provided to BART to run any experiment. In sample/generic-max, we show the same documents,
but with much more information encoded both via MMAX levels and via attributes on the
"markable" level."

...
"Prepare your dataset in the MMAX format, making sure that you include at least all
the information shown in the sample/generic-min example (that is: tokens in Basedata/*words.xml,
coreference levels, pos levels, markable levels specifying markable_id and span for
each markable). "

So if you have a look at the coreference levels, you find a very rich annotation (in
generic-min!) that would require morphosyntactic annotation as well as our new SemanticFieldAnnotator
(category="concrete")
e.g.

sample/generic-min/train/markables/wsjarrau_1128_coref_level.xml
<markable id="markable_77" span="word_1..word_4" generic="generic-no" person="per3"
related_object="no" gram_fnc="subj" number="sing" reference="new" category="concrete"
mmax_level="coref" gender="neut" min_words="picture" min_ids="word_4" coref_set="set_48"/>

Original issue reported on code.google.com by eckle.kohler on 2013-10-03 20:27:27

reckart commented 9 years ago
Writing out data in any particular MMAX2 dialect to have it processed by Bart isn't
something I would consider particularly desirable. I mean, there must be some way to
construct a model in-memory and pass that to whatever parts of Bart perform the actual
processing.

I'm a bit confused about that file (sample/generic-min/train/markables/wsjarrau_1128_coref_level.xml).
I would expect that the tool comes with pretrained models and that the "coref" layer
would be the output. 

So I gather, there is not only no pre-processing included for German, but there are
also no models included for German. In which case, we would need to train our own models...
ok, even if we wanted to do that, on which data? Preprocessing for German producing
these features mentioned above is one thing, but in addition, we would need gold coreference
annotations, right?

One of the main points of integrating many of the tools we integrate is, that we don't
have to train models, because these tools already come with models. If Bart does not
work out-of-the-box, then I actually do wonder if it's worth bothering with it. I guess
for English it works at least, doesn't it? 

Original issue reported on code.google.com by richard.eckart on 2013-10-03 20:42:19

reckart commented 9 years ago
>>So I gather, there is not only no pre-processing included for German, but there are
also no models included for German. In which case, we would need to train our own models...
ok, even if we wanted to do that, on which data? Preprocessing for German producing
these features mentioned above is one thing, but in addition, we would need gold coreference
annotations, right?

I had a look in the GermanLanguagePlugin. This is very basic and partly hacky, but
could be extended.

There is a dataset for German: http://stel.ub.edu/semeval2010-coref/node/7

- task: Detection of full coreference chains, composed of named entities, pronouns,
and full noun phrases.
- data: "The data set comes from the TüBa-D/Z Treebank (Hinrichs et al. 2005), a German
newspaper corpus based on data taken from the daily issues of "die tageszeitung" (taz).
Hand-annotated with inflectional morphology, constituent structure, grammatical functions,
and anaphoric and coreference relations.
Training: 415k words."

>> I guess for English it works at least, doesn't it? 

right, just tried it - out of the box as a web demo

Original issue reported on code.google.com by eckle.kohler on 2013-10-03 21:00:26

reckart commented 9 years ago
I'm not sure if I made my doubts regarding the mmax coref layer you previously referred
to sufficiently explicit. I'll retry (but I didn't do any further investigation yet).

The min example contains two layers: markable, pos, and coref

The max example contains more layer: markable, chunk, enamex, lemma, morph, parse,
phrase, unit, and coref

I expect that coref is the output of BART while the other layers are input. So I would
expect that minimally, BART can work with pos information. If there is morphological
information in the coref layer, I would expect that to be ignored or be generated by
BART as part of the processing - but not prior to the processing.

The "morph" layer in the max example appears to contain only lemma information - in
fact, it appears to be the same as the "lemma" layer.

There is a layer "markable" with additional semantic information in the max example,
but this information is not present in the min example (the layer is there however).

So, I suppose that at least for English, it should be possible to get quite far with
the pre-processing components that we have, possibly including the SemanticFieldAnnotator
or something equivalent which may be included directly with Brat (based on WordNet).
Since English works out-of-the-box, there may be some kind of morphological analysis
included with BRAT as well (also based on WordNet?).

Original issue reported on code.google.com by richard.eckart on 2013-10-04 12:33:25

reckart commented 9 years ago
I did some code reading and think I have largely understood how BART works. This can
be discussed in one of the upcoming meetings.

Regarding http://code.google.com/p/dkpro-core-asl/issues/detail?id=258#c12

- you are right regarding the mmax layers
- for English, we will get worse results than BART end-to-end, if we use just our preprocessing
- for German, we should employ POS tagging and parsing; but will probably get much
worse results than for English, because of the German language plugin currently provided:

BART is kind of a knowledge-based system and the German language plugin is a bit knowledge-poor
(yet) compared to the English language plugin.

Original issue reported on code.google.com by eckle.kohler on 2013-10-04 20:04:25

reckart commented 9 years ago
(No text was entered with this change)

Original issue reported on code.google.com by richard.eckart on 2014-08-14 10:05:19

reckart commented 9 years ago
(No text was entered with this change)

Original issue reported on code.google.com by richard.eckart on 2015-01-22 22:42:54