Can you release the code for pre-processing the dataset?

facebookresearch / mudoco

A synthetic dataset of dialogs we authored and annotated for references (pronouns, etc.). This dataset is discussed in the paper "MuDoCo: Corpus for Multidomain Coreference Resolution and Referring Expression Generation", which appeared at LREC 2020.

Other

22 stars 4 forks source link

Can you release the code for pre-processing the dataset? #1

Open kimdev95 opened 4 years ago

kimdev95 commented 4 years ago

Hi. Can you release the code you have used for pre-processing the dataset? Because I found the dataset is a little bit noisy, and I want to evaluate our coreference resolution model in the same setting as reported in your paper.

Some issues in the dataset are:

Incorrect mention annotations. For example, in the utterance "I'll call them later .", sometimes both "I" and "I'll" are annotated as mentions in your dataset. Another similar example is the utterance "I've sent the PDF to both of them ."
There are some links (A, B) where either mention A or mention B has never appeared in the dialog.

coffeeblack commented 4 years ago

Thanks for the attention to our dataset, and apologies for the late response! This notification got deeply lost in my inbox.

I'll look into releasing the pre-processing code. As for the annotation quality issues, I think it makes sense for each of those to be opened separately (in addition to any others you have found). That way, we can track which were fixed in any subsequent releases of MuDoCo.

coffeeblack commented 3 years ago

Alternatively, @kimdev95 , if you have an automated way to identify issues of the types you mentioned, maybe you could send a full list of all the occurrences? That way we could at least amend and re-release the dataset.