UniversalPropositions / UP-1.0

Universal Proposition Banks for Multilingual Semantic Role Labeling
Other
99 stars 18 forks source link

adapt all datasets to the same conllus format #7

Closed arademaker closed 3 years ago

arademaker commented 4 years ago

This PR:

  1. rename all .conllu files to .conllus adjusting the columns and arguments to make all data follow the same format. That is, all sentences have at least 11 columns. Columns from 1 to 10 follow the CoNLL-U format specification and column 11 specific the predicates in the sentence. For each predicate in column 11, we have one more "predicate column" in the sentence.

  2. In the scripts directory, we have some AWK scripts. In particular, we have long-short.awk and short-long.awk that change arguments from A[0-9M] to ARG[0-9M] and vice-versa. Some scripts may not be useful once this PR is accepted (e.g. the conlluf-conllus.awk was used to convert the previous "fake" CoNLL-U format to the new .conllus format).

  3. I fixed a bug in the English dataset where some sentences ended up having tokens with a different number of columns. Underscore (_) is used to denote unspecified values in all fields, following the CoNLL-U format.

Pending issue:

We still have one difference between the files in the English dataset and the other languages. Each predicate column in the English dataset has a V and C-V (see here) tags indicating the position of the predicate besides its arguments. In other languages, these tags are not presented in the predicates columns, only column 11 identify the tokens annotated as predicates.

arademaker commented 4 years ago

@huaiyu-zhu any update ?

arademaker commented 4 years ago

We may need to think again about this PR. First, because the golden EN data maybe be submitted to https://github.com/propbank/propbank-release instead of being here. The question is how we can make clear the data here that is golden (manually revised) from the one produced automatically.

Second, because the conllus may not be the best approach compared to https://universaldependencies.org/ext-format.html.

alanakbik commented 4 years ago

@arademaker I'd be happy to join the discussion on this!

arademaker commented 3 years ago

please ignore this PR. In a new PR I am suggesting the adoption of conllup format.