dair-iitd / openie6

OpenIE6 system
GNU General Public License v3.0
119 stars 36 forks source link

Can you provide the original PTB dataset used in your work? #7

Closed gao-lex closed 2 years ago

gao-lex commented 3 years ago

To compare with your work, I need the original PTB dataset [1] used in OpenIE6 model. But this data set can't be found on the Internet now. Can you provide one?

[1] Jessica Ficler, Yoav Goldberg: Coordination Annotation Extension in the Penn Tree Bank. ACL (1) 2016

alexeyev commented 2 years ago

Hello!

To compare with your work, I need the original PTB dataset [1] used in OpenIE6 model. But this data set can't be found on the Internet now. Can you provide one?

It seems that the authors' CA dataset can be found here: https://zenodo.org/record/4054476

However, I have failed to find any description of the labels (one may try to guess, but that's not the right way to do research :)) or the evaluation code to reproduce the reported result.

@SaiKeshav may I ask you to share any of that or to suggest where to look? Thank you.

SaiKeshav commented 2 years ago

Hi @alexeyev, the label set being used is 'CC', 'CP_START', 'CP', 'SEP', 'OTHERS' and 'NONE' (defined in line).

NONE stands for words that don't belong to any coordination structure. CC stands for conjunction coordination (and, but), CP stands for coordination phrase (Jeff Bezos, Amazon Company), CP_START stands for start of the entire coordination structure (which will also be start of the first coordination phrase), SEP stands for separators of different coordination phrases (comma) and OTHERS stands for tokens in the coordination structure that don't belong to any of the above categories.

You can look at https://aclanthology.org/I17-1027.pdf (Section 2.1, Task Description) for understanding each of the above phrases and look at this function link to see how the labels are parsed into the respective coordination structures.

alexeyev commented 2 years ago

Hi @SaiKeshav thank you for the clarification!