facebookresearch / clutrr

Diagnostic benchmark suite to explicitly test logical relational reasoning on natural language
Other
90 stars 14 forks source link

Issues in the AMT templates and how to mitigate them #15

Open zhunyoung opened 2 years ago

zhunyoung commented 2 years ago

Dear authors, @koustuvsinha @pminervini @shagunsodhani

Thanks for the great work!

I downloaded the dataset from the provided link https://drive.google.com/file/d/1SEq_e1IVCDDzsBIBhoUQ5pOVH5kxRoZF/view and found a few mistakes in the test dataset. Below are 4 mistakes that I found from the first 10 data instances in file data_06b8f2a1/1.3_test.csv in the dataset. It seems to me that a big portion of the data may not be correct. Index Story Query Target Comment
2 [Kathleen] was excited because she was meeting her father, [Henry], for lunch. [Howard] and his son [Wayne] went to look at cars. [Howard] ended up buying the Mustang. [Howard] likes to spend time with his aunt, [Kathleen], who was excellent at cooking chicken. ('Wayne', 'Henry') father The target should be greatgrandfather.
5 [Johanna] spent a great day shopping with her daughter, [Vickie]. [Vickie] wanted to visit her grandmother [Donna], but [Donna] was asleep. [Johanna] and [Philip] left that evening to go bowling. ("Philip","Donna") mother We cannot tell any relationship for Philip.
6 [Johanna] enjoyed a homemade dinner with her son [Cedric] [Wayne] and his son, [Cedric], went over to [Donna]'s house for the holidays. [Wayne] loved seeing his mother, but [Cedric] was less enthusiastic. ("Johanna","Donna") mother The target should be mother_in_law.
9 [Devin] and his Aunt [Kathleen] flew first class [Devin] has a few children, [Philip], Bradley and Claire [Kathleen] vowed to never trust her father, [Henry] with her debit card again. ("Philip","Henry") father The target should be greatgrandfather.

Since other users already submitted issues to report errors in the dataset a year ago, is there any update to the dataset (e.g., a cleaner version with fewer mistakes)? Thanks a lot!

koustuvsinha commented 2 years ago

Hi @zhunyoung ! Thanks for pointing this out! We already have a project underway cleaning some of these issues - most of these issues stemmed from the use of Amazon Mechanical Turk templates, some of which were not as clean as we hoped. Our next version will be released in the next couple of months, which will fix most of these issues. Thanks for your interest! :)

veronica320 commented 1 year ago

Hi authors, thanks for putting together this dataset! Just wanna follow up if there's any update on fixing the dataset errors. If not, would it be possible to at least have a subset of the dataset which is known to be error-free? This would be very useful for comparing different models.

Thanks in advance for any help!

zharry29 commented 1 year ago

Seconded - it would be quite unfortunate if such a great dataset contains so many errors that render it unuseable. Looking forward to the fix asap!

koustuvsinha commented 1 year ago

Hi @zhunyoung @veronica320 @zharry29, thanks for your interest in the dataset, and apologies for the delayed response. As mentioned earlier, these issues stem from the Amazon Mechanical Turk templates. Specifically, the issue stems from the problem of role swapping - where annotators swapped the roles of entities leading to hierarchically opposite kinship relations.

Since there exists close to 5000 templates, manually re-annotating them is extremely time consuming, and I don't have the bandwidth for it. However, I spent some time to figure out how to extract the relations automatically so that we can at least filter out the logically incorrect templates. Turns out, this is a hard problem, but I have been able setup a process to do so. I have released the new templates in the develop branch, where you can find templates annotated by two models: Flan T5 and GPT3, both of which are surprisingly good at extracting the relation from the templates! Using their annotations you can now filter the templates during dataset generation using the code at the develop branch (CLUTRR v1.3).

I have documented the whole process at this blog post if you are curious to know more / explore alternative methods. Please feel free to provide feedback in this thread, and also let me know if you face any issues generating data using the code at develop branch!

Thanks for reading, and thanks for pointing out this issue in the first place. I'll pin this thread so that future users can be aware of this.

zhunyoung commented 1 year ago

Hi @koustuvsinha,

Thanks for your detailed explanations and your codes and post!

I followed the steps to install the develop branch in a new conda environment. Then, after installing the sklearn package with conda install -c conda-forge scikit-learn, I could successfully run the data generation script ./generate.sh. I noticed that the generated story is super long. Below is a copy of the story in the first test data that is generated using ./generate.sh.

"[Shelton] and his daughter [Louie] took a day off school to go to the zoo. [Louie] and her uncle [Nathaniel] went to the pet shop. [Louie] saw a puppy that she loved, so [Nathaniel] bought it for her. [Malvina] took her grandson [Colin] to the park. [Colin]'s brother [Nathaniel] was already there. [Shelton] took his grandson [Artie] to the baseball game. [Shelton] took his sister [Blanche] out to lunch after learning that she got accepted into her first choice for university. [Shelton] took his grandson [Shelton] and [Shelton]'s brother [Nellie] to the amusement park Saturday and they had a good time. [Jeremiah] and his mother, [Louie], went to a pet store. [Jeremiah] wanted a parrot, but his mom got him a smaller bird instead. [Karl] enjoys picking flowers with his son's daughter. Her name is [Louie]. [Nathaniel] went to his brother [Artie]'s Birthday party [Karl] would n't let his son [Colin] go to the park by himself. [Colin]'s brother [Colin] offered to go with him. [Olin] took his grandson [Colin] to a movie at the local theater. [Serena] went to her son [Colin]'s House [Malvina] was excited because she got to go to the zoo with her grandson [Artie]. [Blanche]'s grandfather, [Olin], baked her a beautiful cake for her 9th birthday. [Serena] just had a baby and presented the baby proudly to the new maternal grandmother, [Allie]. [Nellie]'s grandmother, [Allie], was eager to spend a weekend with all of her grandchildren. [Linnie] asked her aunt [Serena] for 5 dollars for her field trip. [Linnie] made a cake for her grandfather, [Hollie]. [Serena] and her mother [Olin] made breakfast together. [Helen] had picked her daughter [Serena] out the cutest new dress to wear on her birthday. [Blanche] spent a great day shopping with her daughter, [Walter]. [Nellie] dropped his niece [Walter] off at school. [Elizabeth] had a daughter named [Blanche]. [Blanche] and her brother [Shelby] went to see a movie. [Helen] and her husband [Olin] went on a cruise. They had a wonderful time. [Karl] and [Serena] were married twenty years ago today, becoming husband and wife on a glorious spring day. [Helen] picked up her husband, [Olin] from the pool. "

I checked the hyper-parameters in generate.sh but the following settings

MAX_PATH_LEN=5
...
TEST_DESCRIPTOR_LENGTHS=\'3,4\'

seem not to restrict the number of sentences in the generated story. I'm not sure how to regenerate a dataset similar to the original CLUTRR dataset. If you have generated a dataset with the cleaned templates already, could you share that with us for evaluation purposes on our own models? If not, could you give some guidelines to generate such cleaned data?

Thanks a lot!

koustuvsinha commented 1 year ago

Hi @zhunyoung , apologies for the delayed response - the notification of this thread seems to miss my inbox for some reason. The reason I believe is that the noise setting is set to True, which is one of the test conditions of the CLUTRR dataset (we added spurious noise, such as dangling, irrelevant or disconnected paths - please check the paper for more details). If you set it to False (NOISE=false) then you should get a shorter story.

azreasoners commented 1 year ago

Thanks for the help! My goal is to use the new code to

To achieve the goal above, I still need to generate some data instances with NOISE=true. Can the above goal be achieved using the current code or is this part still under development? Thanks!

koustuvsinha commented 1 year ago

@azreasoners just use the flag NOISE_POLICY flag appropriately along with NOISE=true.