Serious inconsistencies in Wizard of Wikipedia data

AshwinParanjape commented 3 years ago

If we look at the validation data, in the original data there are 8840 wizard turns all unique (including both the validation sets). Out of the 3058 val instances only 1069 "answers" (or wizard utterances) are unique. That is, the same wizard utterance is shared by multiple conversational histories.

I also checked the latest version, and the issue still seems to be there. If you search for "answer": "Brown hair is the second in the file - http://dl.fbaipublicfiles.com/KILT/wow-dev-kilt.jsonl, you'll find the answer being used in 38 different conversations. That's not the case in the original data.

The same can be said about the train set. Out of 94577 training instances, there are only 63733 unique inputs and 20427 unique answers. I didn't check the current online version of the file though, because I'm assuming the same bug affects both validation and training files.

Here are the most 10 frequent validation answers (with frequencies)

[("I think I'm going to have mine done by a professional hairdresser,", 45),
 ('Jazz originated in the late 19th century', 44),
 ('Red is the colour at the end of the visible spectrum of light', 38),
 ('Brown hair is the second most common human hair color, after black hair, my hair is also brown but on the lighter side. some parts of my hair change to blonde in the summer.', 38),
 ('I am not 100% sure on that however, I do know that it was founded by Enzo Ferrari and the company built its first car in 1940.', 37),
 ("I probably wouldn't. I'm happy with black hair. Although, hair coloring is definitely on the rise, as 75% of women and 18% of men in Copenhagen, for example, have reported dying their hair, if that gives you any indication.", 37),
 ('Not 100% sure on that, either but Brand Finance rated the car the worlds most powerful car in 2014. That is awesome and I think I need a Ferrari. lol', 29),
 ('It has a wavelenght that starts at 625 nanometres.', 29),
 ('Hello, have you colored your hair before? It is practice of changing the hair color', 28),
 ("I've herd something crazy like 75% of women and 18% of men use hair dye.", 28)]

10 most frequent training answers (with frequencies)

[('It originated from Italy.', 343),
 ('I have one dog! I love selectively bred dogs.', 259),
 ('The first mention of it was in the 10th century, but nobody knows for sure who invented it.', 237),
 ("It's different. Our pizza was invented in Naples, and that's been popular around the world.", 210),
 ('Not right now, but I wish I did they are great for companionship and they can hunt vermin...lol', 190),
 ('So do I! it is one of the three primary colours??', 182),
 ('Yep, blue mixed with green and violet to make turquoise is great as well', 180),
 ("Yes, I see where you're coming from, but theres also potential for dogs proficient in hunting and herding, pulling loads,", 173),
 ('I think veganism is a bit narcissistic. The philosphy behind it I think elevates animals status illogically.', 171),
 ('yea it was founded by richard and maurice mcdonald in san bernardino, california', 166)]

I haven't checked any other datasets, so can't speak for them.

fabiopetroni commented 3 years ago

Thanks for reporting this issue @AshwinParanjape! We are investigating

fabiopetroni commented 3 years ago

There was a little bug in the mapping script for Wizard of Wikipedia that affected a portion of the answers. The good news is that the provenance section is not affected. Also, all the other datasets in KILT are not affected by this bug since we use different and dedicated mapping scripts. @AshwinParanjape thanks again a lot for reporting this inconsistency - we really appreciate it! I fixed the bug and prepared a new version of the data at:

Could you please double check that inconsistencies are actually gone? If so, I'll update the official files as well as the results for this dataset.

AshwinParanjape commented 3 years ago

It does seem that the valid set is fixed, but I'm still finding repeating answers in the train set.

Here are the 10 most common answers

[('Hi, are you a fan of the Baltimore Orioles? They are from Maryland, and are a professional baseball team!', 11),
 ('Hello!', 10),
 ('Hello, have you ever had to take your pets to an animal shelter?', 6),
 ('So you like the Baltimore Orioles baseball team?', 6),
 ('Ohhhh so do I! That lovely yeasted flatbread topped with tomato sauce and cheese is the bomb!', 6),
 ('Its a great color. It is one of the three primary colors.', 6),
 ('atural selection occurs in a population of organisms of the same species when the individuals:', 6),
 ('Oh really, Blue is one of the three primary colors.', 6),
 ('wow, i love swimming too, swimming is usually done for recreation, sport exercise or survival', 5),
 ('Hello, did you want to talk about tattoos?', 5)

With the exception of "hello", others are too specific to get repeated. I checked for the first and third, they only occur once in the train set. The couple I checked were the second turns in the conversation, which might a pattern that might be useful for locating the bug.

The original dataset only has 83160 wizard turns as far as I can tell, but the train set has 94577 instances.

Separately, the valid_random_split has 4442 wizard turns and valid_topic_split has 4398 wizard turns. If I assume that they have both been mixed into a single validation set, why does kilt validation set have only 3055 instances?

fabiopetroni commented 3 years ago

Thanks a lot for your help in this @AshwinParanjape! Very happy to hear that the dev data is now ok. :) The train data indeed contained some duplicated <input,answer> pairs. Could you please try now?

Note that we filtered out from the validation set all instances for which we were unable to map the knowledge evidence to the KILT Wikipedia dump.

AshwinParanjape commented 3 years ago

This is pretty good! One more question though, are you also filtering out some training instances just like validation instances?

The original dataset has 83160 wizard turns, but the current train-kilt.jsonl has only 63734.

fabiopetroni commented 3 years ago

Hey @AshwinParanjape, yes, we filtered out some training instances in wow as well, mainly if the knowledge is unknown and other corner cases. I hope this helps. :)

AshwinParanjape commented 3 years ago

Great! I don't see any more discrepancies! So I'll close the issue.

Thanks for all the prompt help, that was truly fabulous :)

Would it be possible to document the filtering criterion for both the train and valid sets? I think it is a bit confusing for anyone wanting to compare with results from previous papers that used the original dataset.

fabiopetroni commented 3 years ago

Thanks again a lot @AshwinParanjape - I'm now going to update the official wow KILT files. The filtering criteria are already documented in the KILT paper (especially for dev and test set) - we also have a section where we analyse the effect of our filtering wrt using the original dataset. ;)

facebookresearch / KILT

Serious inconsistencies in Wizard of Wikipedia data #41