Closed reckart closed 1 year ago
I downloaded it.
For English, it does not seem to make sense to integrate it in DKPro Core, since breaking
up the BART end-to-end system for English will lead to a decrease in performance.
For other languages (e.g. German), quite some effort seems to be required to ensure
the minimum level of preprocessing needed to run BART.
Original issue reported on code.google.com by eckle.kohler
on 2013-10-01 13:10:00
I don't understand why it does not make sense to integrate it.
Original issue reported on code.google.com by torsten.zesch
on 2013-10-01 13:12:19
*SEEM*
Original issue reported on code.google.com by eckle.kohler
on 2013-10-01 13:15:08
Ok, let me rephrase:
As you downloaded the tool and had a look, why do you think it does not _seem_ to make
sense to integrate it.
Original issue reported on code.google.com by torsten.zesch
on 2013-10-01 13:20:51
I can imagine that it would be quite some effort to integrate it. But I'd rather be
curious why a decrease in performance is to be expected? I assume you mean "performance"
in the sense of quality, not speed. Except for the effort, I think it would be nice
in any case to integrate it, because we simply don't have much for coreference yet.
I think (hope) that most of the preprocessing required should already be present in
DKPro Core.
Original issue reported on code.google.com by richard.eckart
on 2013-10-01 13:37:34
Torsten wrote:
>> why do you think it does not _seem_ to make sense to integrate it.
My comment relates to the effort associated with integrating it. For English, we also
have the Stanford Coreference Resolver which is state.of-the-art.
So the question is: is it worth the effort for English?
Richard wrote:
>> I assume you mean "performance" in the sense of quality, not speed.
right
>> I think (hope) that most of the preprocessing required should already be present
in DKPro Core.
For German, the morphosyntactic preprocessing is not well covered right now.
BART makes heavy use of the morphosyntactic properties gender, number and case.
I think, we would have to create an appropriate type for this kind of information first.
Original issue reported on code.google.com by eckle.kohler
on 2013-10-03 20:02:11
Some background on the performance on other languages:
"a language-agnostic
system (designed primarily for English) can achieve a per-
formance level in high forties (MUC F-score) when re-
trained and tested on a new language, at least on gold
mention boundaries. Though this number might appear
low, note that it is a baseline requiring no extra engineer-
ing." see http://www.lrec-conf.org/proceedings/lrec2010/pdf/755_Paper.pdf
This performance is really low. I would not want to use such a component.
Original issue reported on code.google.com by eckle.kohler
on 2013-10-03 20:06:23
I believe the Stanford coreferencer also uses such information (gender, etc.) but it
brings its own resources for these things. I'd tend to try feed Bart with anything
that we already can produce (token, pos, lemma, named entity, etc.) and let Bart handle
all the things we cannot yet produce (e.g. gender and so on). Getting anything to work
would already be quite nice, factoring out additional steps could happen afterwards.
Btw. I also downloaded Bart 2.0 and had a brief look. I still have no idea where to
hook in :( It seems the code does not only contain a single component or pipeline,
but is rather a full coreference construction kit with many things that one actually
wouldn't need only to do the "default" coref resolution.
Original issue reported on code.google.com by richard.eckart
on 2013-10-03 20:13:01
As far as I understand, regarding German, BART does not bring the preprocessing resources:
README:
"We do not support preprocessing for languages other than English. So, to run BART
on another language, you first have to preprocess your data yourself, generating all
the necessary markable levels, including the "markable" level that contains info on
the mentions. In sample/generic-min, we show the minimal amount of information to be
provided to BART to run any experiment. In sample/generic-max, we show the same documents,
but with much more information encoded both via MMAX levels and via attributes on the
"markable" level."
...
"Prepare your dataset in the MMAX format, making sure that you include at least all
the information shown in the sample/generic-min example (that is: tokens in Basedata/*words.xml,
coreference levels, pos levels, markable levels specifying markable_id and span for
each markable). "
So if you have a look at the coreference levels, you find a very rich annotation (in
generic-min!) that would require morphosyntactic annotation as well as our new SemanticFieldAnnotator
(category="concrete")
e.g.
sample/generic-min/train/markables/wsjarrau_1128_coref_level.xml
<markable id="markable_77" span="word_1..word_4" generic="generic-no" person="per3"
related_object="no" gram_fnc="subj" number="sing" reference="new" category="concrete"
mmax_level="coref" gender="neut" min_words="picture" min_ids="word_4" coref_set="set_48"/>
Original issue reported on code.google.com by eckle.kohler
on 2013-10-03 20:27:27
Writing out data in any particular MMAX2 dialect to have it processed by Bart isn't
something I would consider particularly desirable. I mean, there must be some way to
construct a model in-memory and pass that to whatever parts of Bart perform the actual
processing.
I'm a bit confused about that file (sample/generic-min/train/markables/wsjarrau_1128_coref_level.xml).
I would expect that the tool comes with pretrained models and that the "coref" layer
would be the output.
So I gather, there is not only no pre-processing included for German, but there are
also no models included for German. In which case, we would need to train our own models...
ok, even if we wanted to do that, on which data? Preprocessing for German producing
these features mentioned above is one thing, but in addition, we would need gold coreference
annotations, right?
One of the main points of integrating many of the tools we integrate is, that we don't
have to train models, because these tools already come with models. If Bart does not
work out-of-the-box, then I actually do wonder if it's worth bothering with it. I guess
for English it works at least, doesn't it?
Original issue reported on code.google.com by richard.eckart
on 2013-10-03 20:42:19
>>So I gather, there is not only no pre-processing included for German, but there are
also no models included for German. In which case, we would need to train our own models...
ok, even if we wanted to do that, on which data? Preprocessing for German producing
these features mentioned above is one thing, but in addition, we would need gold coreference
annotations, right?
I had a look in the GermanLanguagePlugin. This is very basic and partly hacky, but
could be extended.
There is a dataset for German: http://stel.ub.edu/semeval2010-coref/node/7
- task: Detection of full coreference chains, composed of named entities, pronouns,
and full noun phrases.
- data: "The data set comes from the TüBa-D/Z Treebank (Hinrichs et al. 2005), a German
newspaper corpus based on data taken from the daily issues of "die tageszeitung" (taz).
Hand-annotated with inflectional morphology, constituent structure, grammatical functions,
and anaphoric and coreference relations.
Training: 415k words."
>> I guess for English it works at least, doesn't it?
right, just tried it - out of the box as a web demo
Original issue reported on code.google.com by eckle.kohler
on 2013-10-03 21:00:26
I'm not sure if I made my doubts regarding the mmax coref layer you previously referred
to sufficiently explicit. I'll retry (but I didn't do any further investigation yet).
The min example contains two layers: markable, pos, and coref
The max example contains more layer: markable, chunk, enamex, lemma, morph, parse,
phrase, unit, and coref
I expect that coref is the output of BART while the other layers are input. So I would
expect that minimally, BART can work with pos information. If there is morphological
information in the coref layer, I would expect that to be ignored or be generated by
BART as part of the processing - but not prior to the processing.
The "morph" layer in the max example appears to contain only lemma information - in
fact, it appears to be the same as the "lemma" layer.
There is a layer "markable" with additional semantic information in the max example,
but this information is not present in the min example (the layer is there however).
So, I suppose that at least for English, it should be possible to get quite far with
the pre-processing components that we have, possibly including the SemanticFieldAnnotator
or something equivalent which may be included directly with Brat (based on WordNet).
Since English works out-of-the-box, there may be some kind of morphological analysis
included with BRAT as well (also based on WordNet?).
Original issue reported on code.google.com by richard.eckart
on 2013-10-04 12:33:25
I did some code reading and think I have largely understood how BART works. This can
be discussed in one of the upcoming meetings.
Regarding http://code.google.com/p/dkpro-core-asl/issues/detail?id=258#c12
- you are right regarding the mmax layers
- for English, we will get worse results than BART end-to-end, if we use just our preprocessing
- for German, we should employ POS tagging and parsing; but will probably get much
worse results than for English, because of the German language plugin currently provided:
BART is kind of a knowledge-based system and the German language plugin is a bit knowledge-poor
(yet) compared to the English language plugin.
Original issue reported on code.google.com by eckle.kohler
on 2013-10-04 20:04:25
(No text was entered with this change)
Original issue reported on code.google.com by richard.eckart
on 2014-08-14 10:05:19
(No text was entered with this change)
Original issue reported on code.google.com by richard.eckart
on 2015-01-22 22:42:54
Original issue reported on code.google.com by
nico.erbs
on 2013-10-01 12:36:28