UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
272 stars 247 forks source link

Unified Semantic Role Labels for UD Datasets #344

Open alanakbik opened 8 years ago

alanakbik commented 8 years ago

Hello all,

at IBM Research, we have been working on a layer of unified semantic annotations for a range of languages. We use a data-driven approach in which we re-use existing English Proposition Bank frame and role labels for new target languages, followed by a process of manual curation (ACL 2015, ACL 2016, EMNLP 2016).

For instance, consider the German sentence "Seine Arbeit wird von ehrenamtlichen Helfern und Regionalgruppen des Vereins unterstützt" (His work is supported by volunteers and regional groupings of the association). In CoNLL format, it looks like this, with English PropBank labels in the last two columns:

Id Form POS HeadId Deprel Frame Role
1 Seine DET 2 det:poss _ _
2 Arbeit NOUN 11 nsubjpass _ A1
3 wird AUX 11 auxpass _ _
4 von ADP 6 case _ _
5 ehrenamtlichen ADJ 6 amod _ _
6 Helfern NOUN 11 nmod _ A0
7 und CONJ 6 cc _ _
8 Regionalgruppen NOUN 6 conj _ _
9 des DET 10 det _ _
10 Vereins NOUN 8 nmod _ _
11 unterstützt VERB 0 root support.01 _
12 . PUNCT 11 punct _ _

The German verb 'unterstützt' is labeled as evoking the 'support.01' frame with two roles: "Seine Arbeit" (his work) is labeled A1 (project being supported) and "ehrenamtlichen Helfern und Regionalgruppen des Vereins" (volunteers and regional groupings of the association) is labeled A0 (the helper).

With such data, we can create SRL systems that predict English PropBank labels for many different languages. See a recent demo screencast of this SRL for English, French and German here.

Contribute to UD?

We are now looking into releasing parts of this data to the research community. In particular, we are thinking of contributing this layer of annotation to the universal dependencies data sets (the sentence above is from the German UD dataset).

For this, we would like to know 1) if there is interest from your side to include such labels into the data sets and 2) if so, how such a contribution could be organized. Please let us know your thoughts on this!

Cheers, Alan

__ Alan Akbik IBM Research Almaden http://alanakbik.github.io/

nschneid commented 8 years ago

@alanakbik, thanks for posting this! I'm very much interested in the broad-coverage, multilingual semantic role labeling space, and I think it would be great to explore whether a UD-like project in this direction makes sense—perhaps using UD itself as a syntactic starting point.

That said, I'm under the impression that the UD project wants to stay focused on syntax, not semantics. (UD core members, please correct me if I'm mistaken!)

arademaker commented 8 years ago

@nschneid it seems to me some directions to semantics. As @dan-zeman said once: "UD's philosophy is to make relations between content words the backbone of the dependency structure" and it may be a better structure for capturing the meaning. Moreover, the addition of further columns in the conllu format would not impose a problem and I found some mentions on the UD website about the presence of some semantic frame information in a Japanese corpus. Sure, just some thoughts...

dan-zeman commented 8 years ago

The UD project stays focused on syntax because there is a limit to the amount of things one can focus on. Personally I think that this is one of the potentially useful extensions of Universal Dependencies. (Other possible extensions include named entities and MWEs.) I can imagine that it appears as an additional tarball within a UD release: there will be only languages for which SRL annotation is available, and the same languages without the additional columns will still be included in the main tarball, so that people (and tools) that do not care about SRL will not see it. There is a logistic problem though. Treebank maintainers are allowed to modify their data (meaning UD proper) until about 14 days before the release. The 14 days are reserved for the release team to put everything together and solve any issues. There is no time for a SRL team (or any other team doing extended layers above UD) to synchronize their data with the new underlying data. So maybe it would be easier if the UD-SRL collection is a separate release in Lindat, that follows a UD release, and explicitly states which UD release it is based on.

alanakbik commented 8 years ago

@nschneid @arademaker @dan-zeman thanks for the feedback!

I like the idea of adding the SRL annotated data as separate files next to the unannotated files. This also is in line with what we discussed with @jnivre at ACL. This would be then be an optional layer of annotation available only for supported languages.

Synchronizing data would not be difficult unless there are major changes that affect either

  1. the corpus itself (new sentences are added for which we do not yet have SRL), or
  2. the constituent structure of the sentences (since SRL sits on top of constituents identified by dependeny trees)

From what I can see of the languages we have been looking this type of change does not occur frequently at scale, but please correct me if I'm wrong.

Do you see this as a possible way forward?

dan-zeman commented 8 years ago

@alanakbik : You are right that these changes are not frequent but they do occur. We have seen sentences added and removed in multiple languages. I have no evidence of 2 but it cannot be excluded. I recommend that you at least monitor the dev branches of the corpora for which you have SRL. Maybe even get in touch with the maintainers, because sometimes the real development takes place elsewhere, and the first sign of a change is the change itself, uploaded a day before the deadline :-)

jnivre commented 8 years ago

Anything can happen in principle between versions, but these changes should be rare in general. The important thing is to be careful about versioning, but it would be nice if we could have such mechanism so that treebank providers are automatically alerted when they make changes that could mess up additional annotations. Any ideas?

amir-zeldes commented 8 years ago

Thumbs up from me for separate files - then you can independently commit to a treebank and run a script to check for identical tokenization, with warnings, but not conflicts if it fails.

alanakbik commented 8 years ago

That sounds like a good idea. Since most of the additional annotations suggested so far (MWEs, NEs, SRL) seem to be more loosely coupled to the underlying treebanks, this would place the burden of maintaining consistency on the committers of the additional layers - which seems appropriate. No need to burden the treebank maintainers with this.

Separate files could be placed either directly next to the original UD files, or within a subfolder. For instance, I could imagine a subfolder structure for additional annotations like this, for German UDs:

# original UD treebank files
UD_German/
UD_German/de-ud-dev.conllu
UD_German/de-ud-test.conllu
UD_German/de-ud-train.conllu

# additional annotations
UD_German/additional-annotations/
UD_German/additional-annotations/propbank/
UD_German/additional-annotations/propbank/de-ud+srl-dev.conllu
UD_German/additional-annotations/propbank/de-ud+srl-test.conllu
UD_German/additional-annotations/propbank/de-ud+srl-train.conllu

Is this something you would like to try? We could begin with SRL for one language first as a test balloon and see if it works.

alanakbik commented 8 years ago

Quick update: We are still preparing the release on our end. We will also be at EMNLP from tomorrow to Friday, so please contact me if you'd like to meet up and discuss release details in person!

jnivre commented 8 years ago

Unfortunately, I won’t be at EMNLP this year. Have fun!

Joakim

alanakbik commented 7 years ago

Hello all,

we have made the first "universal proposition banks" for multilingual SRL publicly available here: https://github.com/System-T/UniversalPropositions

The version currently online covers three languages: Chinese, French and German. It is built on top of release 1.4 of UD, adding additional columns for the SRL annotation. We use PropBank v3.0 frame and role labels.

For now, we have opted on releasing this in a separate github project - since this is very much ongoing research. We'd like to stay in the discussion though on how to incorporate such additional layers more directly into UD.

Also, please do check out the project! The easiest way to explore the formalism would be to check out the frame overview files for all Chinese verbs, French verbs and German verbs that we currently cover! Do let us know if you have any questions or comments!

Cheers, Alan

(P.S.: This post was originally mistakenly opened as a separate issue. But it belongs to this thread, so pasting it as a comment here instead)

jnivre commented 7 years ago

Great news indeed! Thanks for sharing this. Giving better guidelines and support for adding additional annotation on top of UD treebanks is definitely on our todo-list but we have simply been too busy with v2 lately.

fginter commented 7 years ago

Excellent! Thank you. We should look into integrating our Finnish Propbank http://turkunlp.github.io/Finnish_PropBank/

alanakbik commented 7 years ago

@fginter Yes, absolutely! Are you at COLING? @kanayamah will also be there with whom we're looking at creating a Japanese Propbank over the Japanese UD. If so, we could all meet up this week?

jnivre commented 7 years ago

@fginter is not at COLING, but some of us are. We are having a get-together on Tuesday evening.

@spyysalo can give you the coordinates.