facebookresearch / ParlAI

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.
https://parl.ai
MIT License
10.48k stars 2.09k forks source link

Issues with Wizard of Wikipedia #2417

Closed g-karthik closed 4 years ago

g-karthik commented 4 years ago

Hello!

I'm trying to parse examples to train a knowledge selection model with the Wizard of Wikipedia dataset. I'm creating label candidates by following the code here.

I am finding that there are some cases where some candidates fetched using the code above have > 4000 tokens. Here's a couple sample candidates I printed from the label_cands list in the above linked code.

Search for Index of philatelic articles in train.json and you will find the below in the list of candidates.

index of philatelic articles acknowledgement of receipt adhesive ( stamp gum ) admirals aerogram aerophilately affixing machine airmail airmail etiquette airmail stamp alexandria " blue boy " postmaster's provisional arrow block asian philately astrophilately auction ( philatelic ) asian pacific postal union balloon mail bicycle mail bisect bogus postal markings bogus stamp issue booklet british guiana 1c magenta bulk mail cachet camel mail cancellation cancelled to order caribou mail carrier's stamp censored mail center line block certified mail charity stamp chinese new year stamps christmas seal christmas stamp cigarette tax stamp cinderella stamp circular delivery mail classic stamp coil stamp - coil waste color guide color trial combination cover commemorative issue commemorative stamp concentration camp mail control mark counterfeit stamps courier mail cover crash cover crown agents philatelic and security printing archive cut square damaged mail dead letter mail definitive issue definitive series delayed mail design error die proof diplomatic pouch mail dirigible mail disinfected mail dog mail dogsled mail dummy stamp earliest known use ( eku ) embossing engraver's mark engraving entire envelope errors and varieties errors, freaks, and oddities ( efo ) essay expert europa postage stamp expertization express company express mail famous stamps fancy cancel favor cancel favor sheet fee paid mail field post office first day ceremony first day cover first day of issue first flight cover first issue fiscal cancel fiscal issue flat plate press floor sweepings forerunner forged stamps forwarding agent fractional currency franchise stamp franking privilege free frank fumigated mail graphite lined stamp grill guide line guide line pair gum gutter block gutter pair handstruck stamp health stamp highway post office history of philately hotel post hovercraft mail illegal stamps illustration law imitation stamp imperforate imprint block imprinted stamp postage stamp ink inscription inscription block institutional collection insured mail international mail international reply coupon inverted jenny inverted swan philatelic investment irradiated mail interrupted mail - italian philatelic press association james chalmers joint issue joint line joint line pair killer kiloware label late fee stamp letter carrier letterpress letter sheet line pair linn's stamp news list of entities that have issued postage stamps list of philatelic topics ( deliberate self - link ) list of philatelists list of notable postage stamps list of stamp catalogues list of stamp collectors list of stamp dealers list of united states airmail stamps lithography local post luminescent issue mail delivery by animal mail fraud mail robbery mailman marcophily marginal marking marine insurance stamp maritime mail maximaphily maximum card metered mail michel catalog military mail millennium stamp miniature sheet minipack mixed franking mobile post office money order mr. zip nassau street ( manhattan ) naval cover naval mail navigation and commerce issue new issue newspaper stamp newspaper wrapper nicholas f. seebeck occupation stamp offices abroad official mail offset printing overprint packet letter packet mark postage stamp paper paquebot parcel post paste - up pair penalty mail pen cancel penny post perfin perforation - perforation gauge permit mail phantom issue philatelic agency philatelic cover philatelic literature philatelic museums philately phq cards picture post card pigeon mail pillar box plate block plate marking plate number coil printing plate plating plebiscite issue pneumatic mail postage due postage stamp booklet postage stamp color postage stamp reuse postage stamp separation postal card postal convention postal history postal laws and regulations postal marking postal route postal savings postal slogan postal stationery postal tax postal treaty postcard post office post office cards post office circulars post road precancel presentation album presentation book price list postage stamp printing postal relation prisoner - of - war mail private cancellation private carrier private overprint stamp proof provisional stamp railway post office ration stamp receiving mark red cross label registered mail regummed stamp reissue reprint remainder reperforation reply card reply coupon revenue cancellation revenue stamp rocket mail rotary press rouletting rowland hill savings issue scott catalog se - tenant semi - official semi - postal separation siege mail - siderography slogan cancellation socked on the nose ( son ) souvenir card souvenir sheet space cover space mail special delivery special handling specimen stamp stamp album stamp catalog stamp collecting stamp condition stamp design stamp exhibition stamp finder stamp gum stamp hinge stamp mounting stamp separation stanley gibbons steamship issue streetcar mail strike mail study circle surcharge tax stamp telegraph stamp test stamp tete - beche thematic collecting tin can mail topical collecting training stamp transatlantic mail transoceanic mail treaty port treskilling yellow triptych unaccepted design undeliverable mail undesirable issue uniform fourpenny post uniform penny post universal postal union untagged used abroad valentine cover varnish bars stamp vending machine view card want list war cover war issue war savings issue war tax stamp watermark wrapper ( philately ) wreck cover yvert catalog z grill zeppelin mail zip block

Another case:

Search for TRL's Number Ones in train.json and you will find the below in the list of candidates.

trl's number ones january 2 : " lose yourself " - eminem january 3 : " i'm with you " - avril lavigne january 6 : " cry me a river " - justin timberlake january 7 : " i'm with you " - avril lavigne january 8 : " i'm with you " - avril lavigne january 9 : " cry me a river " - justin timberlake january 10 : " cry me a river " - justin timberlake january 13 : " cry me a river " - justin timberlake january 14 : " all i have " - jennifer lopez featuring ll cool j january 15 : " cry me a river " - justin timberlake january 16 : " i'm with you " - avril lavigne january 17 : " i'm with you " - avril lavigne january 21 : " the anthem " - good charlotte january 22 : " the anthem " - good charlotte january 23 : " the anthem " - good charlotte january 24 : " cry me a river " - justin timberlake january 27 : " the anthem " - good charlotte january 28 : " the anthem " - good charlotte january 29 : " the anthem " - good charlotte january 30 : " bump, bump, bump " - b2k featuring p. diddy january 31 : " the anthem " - good charlotte february 3 : " the anthem " - good charlotte february 4 : " in da club " - 50 cent february 5 : " the anthem " - good charlotte february 6 : " in da club " - 50 cent february 7 : " in da club " - 50 cent february 10 : " the anthem " - good charlotte february 11 : " in da club " - 50 cent february 12 : " in da club " - 50 cent february 13 : " the anthem " - good charlotte february 14 : valentine's day special # 1 video : " why i love you " - b2k february 17 : trl awards # 1 video : " larger than life " - backstreet boys february 18 : " in da club " - 50 cent february 19 : " in da club " - 50 cent february 20 : " in da club " - 50 cent february 21 : " in da club " - 50 cent february 24 : " in da club " - 50 cent february 25 : " in da club " - 50 cent february 26 : " rock your body " - justin timberlake february 27 : " in da club " - 50 cent february 28 : " sing for the moment " - eminem march 3 : " in da club " - 50 cent march 4 : " sing for the moment " - eminem march 5 : " the anthem " - good charlotte march 6 : " in da club " - 50 cent march 7 : " the anthem " - good charlotte march 10 : " in da club " - 50 cent march 11 : " sing for the moment " - eminem march 12 : " the anthem " - good charlotte march 13 : " the anthem " - good charlotte march 14 : " sing for the moment " - eminem march 17 : " in da club " - 50 cent march 18 : " the anthem " - good charlotte march 19 : " the anthem " - good charlotte march 20 : coverage on the war in iraq march 21 : " the anthem " - good charlotte march 24 : " the hell song " - sum 41 march 25 : " the anthem " - good charlotte march 26 : " the anthem " - good charlotte march 27 : " the anthem " - good charlotte march 31 : " the anthem " - good charlotte april 1 : " girlfriend " - b2k april 2 : " the anthem " - good charlotte april 3 : " rock your body " - justin timberlake april 4 : " rock your body " - justin timberlake april 7 : " rock your body " - justin timberlake april 8 : " rock your body " - justin timberlake april 9 : " addicted " - simple plan april 10 : " rock your body " - justin timberlake april 11 : " fighter " - christina aguilera april 14 : " fighter " - christina aguilera april 15 : " fighter " - christina aguilera april 16 : " fighter " - christina aguilera april 17 : " 21 questions " - 50 cent featuring nate dogg april 18 : " 21 questions " - 50 cent featuring nate dogg april 21 : " fighter " - christina aguilera april 22 : " 21 questions " - 50 cent featuring nate dogg april 23 : " rock your body " - justin timberlake april 24 : " fighter " - christina aguilera april 25 : " 21 questions " - 50 cent featuring nate dogg april 28 : " fighter " - christina aguilera april 29 : " fighter " - christina aguilera april 30 : " fighter " - christina aguilera may 1 : " fighter " - christina aguilera may 2 : " fighter " - christina aguilera may 5 : " fighter " - christina aguilera may 6 : " fighter " - christina aguilera may 7 : " fighter " - christina aguilera may 8 : " fighter " - christina aguilera may 9 : " 21 questions " - 50 cent featuring nate dogg may 12 : " miss independent " - kelly clarkson may 13 : " miss independent " - kelly clarkson may 14 : " fighter " - christina aguilera may 15 : " miss independent " - kelly clarkson may 16 : " miss independent " - kelly clarkson may 19 : " 21 questions " - 50 cent featuring nate dogg may 20 : " miss independent " - kelly clarkson may 21 : " fighter " - christina aguilera may 22 : " fighter " - christina aguilera may 26 : " miss independent " - kelly clarkson may 27 : " miss independent " - kelly clarkson may 28 : " miss independent " - kelly clarkson may 29 : " miss independent " - kelly clarkson may 30 : " miss independent " - kelly clarkson june 2 : " girls & boys " - good charlotte june 3 : " girls & boys " - good charlotte june 4 : " rock wit u " - ashanti june 5 : " girls & boys " - good charlotte june 9 : " miss independent " - kelly clarkson june 10 : " girls & boys " - good charlotte june 11 : " girls & boys " - good charlotte june 12 : " girls & boys " - good charlotte june 16 : " girls & boys " - good charlotte june 17 : " crazy in love " - beyonce featuring jay - z june 18 : " crazy in love " - beyonce featuring jay - z june 19 : " miss independent " - kelly clarkson june 23 : " girls & boys " - good charlotte june 24 : " crazy in love " - beyonce featuring jay - z june 25 : summer anthems # 1 video : " hot in herre " - nelly june 26 : " crazy in love " - beyonce featuring jay - z june 30 : " crazy in love " - beyonce featuring jay - z july 1 : " miss independent " - kelly clarkson july 2 : " miss independent " - kelly clarkson july 3 : " crazy in love " - beyonce featuring jay - z july 7 : " crazy in love " - beyonce featuring jay - z july 8 : " crazy in love " - beyonce featuring jay - z july 9 : " crazy in love " - beyonce featuring jay - z july 10 : dynamic duos # 1 video : " girlfriend ( remix ) " -'n sync featuring nelly july 11 : " crazy in love " - beyonce featuring jay - z july 14 : " can't hold us down " - christina aguilera featuring lil'kim july 15 : " can't hold us down " - christina aguilera featuring lil'kim july 16 : " crazy in love " - beyonce featuring jay - z july 17 : " can't hold us down " - christina aguilera featuring lil'kim july 18 : " can't hold us down " - christina aguilera featuring lil'kim july 21 : " p. i. m. p. ( remix ) " - 50 cent featuring snoop dogg & g - unit july 22 : " can't hold us down " - christina aguilera featuring lil'kim july 23 : " p. i. m. p. ( remix ) " - 50 cent featuring snoop dogg & g - unit july 24 : funniest video # 1 video : " without me " - eminem july 25 : " p. i. m. p. ( remix ) " - 50 cent featuring snoop dogg & g - unit july 28 : " p. i. m. p. ( remix ) " - 50 cent featuring snoop dogg & g - unit july 29 : " senorita " - justin timberlake featuring pharrell july 30 : " senorita " - justin timberlake featuring pharrell july 31 : " senorita " - justin timberlake featuring pharrell august 1 : " senorita " - justin timberlake featuring pharrell august 4 : " senorita " - justin timberlake featuring pharrell august 5 : " senorita " - justin timberlake featuring pharrell august 6 : " can't hold us down " - christina aguilera featuring lil'kim august 7 : " p. i. m. p. ( remix ) " - 50 cent featuring snoop dogg & g - unit august 8 : " can't hold us down " - christina aguilera featuring lil'kim august 11 : " senorita " - justin timberlake featuring pharrell august 12 : " can't hold us down " - christina aguilera featuring lil'kim august 13 : " p. i. m. p. ( remix ) " - 50 cent featuring snoop dogg & g - unit august 14 : " senorita " - justin timberlake featuring pharrell august 15 : august 18 : " p. i. m. p. ( remix ) " - 50 cent featuring snoop dogg & g - unit august 19 : " p. i. m. p. ( remix ) " - 50 cent featuring snoop dogg & g - unit august 20 : " p. i. m. p. ( remix ) " - 50 cent featuring snoop dogg & g - unit august 21 : " low " - kelly clarkson august 26 : " p. i. m. p. ( remix ) " - 50 cent featuring snoop dogg & g - unit august 27 : " right thurr " - chingy august 28 : " low " - kelly clarkson august 29 : " right thurr " - chingy september 2 : " right thurr " - chingy september 3 : " baby boy " - beyonce featuring sean paul september 4 : " baby boy " - beyonce featuring sean paul september 5 : " low " - kelly clarkson september 8 : " low " - kelly clarkson september 9 : " baby boy " - beyonce featuring sean paul september 10 : " so yesterday " - hilary duff september 11 : " so yesterday " - hilary duff september 12 : " baby boy " - beyonce featuring sean paul september 15 : " baby boy " - beyonce featuring sean paul september 16 : " hey ya " - outkast september 17 : " hey ya " - outkast september 18 : " hey ya " - outkast september 19 : " so yesterday " - hilary duff september 22 : " hey ya " - outkast september 23 : " hey ya " - outkast september 24 : " hey ya " - outkast september 25 : " hey ya " - outkast september 26 : " hey ya " - outkast september 29 : " hey ya " - outkast september 30 : " hey ya " - outkast october 1 : " hey ya " - outkast october 2 : " hey ya " - outkast october 3 : " senorita " - justin timberlake featuring pharrell october 6 : " hey ya " - outkast october 7 : " baby boy " - beyonce featuring sean paul october 8 : " hey ya " - outkast october 9 : " perfect " - simple plan october 10 : " hey ya " - outkast october 14 : " hey ya " - outkast october 15 : " hey ya " - outkast october 16 : " hey ya " - outkast october 17 : " trouble " - p! nk october 18 : " trouble " - pink october 21 : " the voice within " - christina aguilera october 22 : " hey ya " - outkast october 23 : " the voice within " - christina aguilera october 27 : " me against the music " - britney spears featuring madonna october 28 : " me against the music " - britney spears featuring madonna october 29 : " me against the music " - britney spears featuring madonna october 30 : " will you " - p. o. d. october 31 : " me against the music " - britney spears featuring madonna november 3 : " me against the music " - britney spears featuring madonna november 4 : " me against the music " - britney spears featuring madonna november 5 : " me against the music " - britney spears featuring madonna november 6 : " me against the music " - britney spears featuring madonna november 7 : " me against the music " - britney spears featuring madonna november 10 : " me against the music " - britney spears featuring madonna november 11 : " me against the music " - britney spears featuring madonna november 12 : " me against the music " - britney spears featuring madonna november 13 : " me against the music " - britney spears featuring madonna november 14 : " me against the music " - britney spears featuring madonna november 17 : " hold on " - good charlotte november 18 : " hold on " - good charlotte november 19 : " me against the music " - britney spears featuring madonna november 20 : " hold on " - good charlotte november 24 : " hold on " - good charlotte november 25 : " me against the music " - britney spears featuring madonna december 1 : " me against the music " - britney spears featuring madonna december 2 : " me against the music " - britney spears featuring madonna december 3 : " invisible " - clay aiken december 4 : " feeling this " - blink - 182 december 5 : " invisible " - clay aiken december 8 : " invisible " - clay aiken december 9 : " invisible " - clay aiken december 10 : " me against the music " - britney spears featuring madonna december 11 : " feeling this " - blink - 182 december 12 : " invisible " - clay aiken december 15 : " invisible " - clay aiken december 16 : " me against the music " - britney spears featuring madonna december 17 : " me, myself, and i " - beyonce december 18 : " me, myself, and i " - beyonce december 19 : " me, myself, and i " - beyonce december 22 : " me, myself, and i " - beyonce december 23 : " me, myself, and i " - beyonce december 29 : " poppin them thangs " - g - unit december 30 : song of the year : " the anthem " - good charlotte january 5 : " feeling this " - blink - 182 january 6 : " poppin them thangs " - g - unit january 7 : " poppin them thangs " - g - unit january 8 : " feeling this " - blink - 182 january 9 : " invisible " - clay aiken january 12 : " invisible " - clay aiken january 13 : " invisible " - clay aiken january 14 : " feeling this " - blink - 182 january 15 : " invisible " - clay aiken january 16 : " invisible " - clay aiken january 19 : " toxic " - britney spears january 20 : " toxic " - britney spears january 21 : " invisible " - clay aiken january 22 : " invisible " - clay aiken january 26 : " toxic " - britney spears january 27 : " invisible " - clay aiken january 28 : " toxic " - britney spears january 29 : " toxic " - britney spears january 30 : " one call away " - chingy featuring jason weaver february 2 : " toxic " - britney spears february 9 : " toxic " - britney spears february 10 : " one call away " - chingy featuring j / weav february 11 : " toxic " - britney spears february 12 : " toxic " - britney spears february 13 : " hold on " - good charlotte february 16 : " toxic " - britney spears february 17 : " one call away " - chingy featuring j / weav february 18 : " toxic " - britney spears february 19 : " sorry 2004 " - ruben studdard february 20 : " toxic " - britney spears february 23 : " toxic " - britney spears february 24 : " toxic " - britney spears february 25 : " toxic " - britney spears february 26 : " yeah " - usher featuring ludacris & lil'jon february 27 : " yeah " - usher featuring ludacris & lil'jon march 1 : " yeah " - usher featuring ludacris & lil'jon march 2 : " yeah " - usher featuring ludacris & lil'jon march 3 : " yeah " - usher featuring ludacris & lil'jon march 4 : " yeah " - usher featuring ludacris & lil'jon march 5 : " yeah " - usher featuring ludacris & lil'jon march 8 : " toxic " - britney spears march 9 : " yeah " - usher featuring ludacris & lil'jon march 10 : " the way " - clay aiken march 11 : " yeah " - usher featuring ludacris & lil'jon march 12 : " yeah " - usher featuring ludacris & lil'jon march 15 : " yeah " - usher featuring ludacris & lil'jon march 16 : " yeah " - usher featuring ludacris & lil'jon march 17 : " the way " - clay aiken march 18 : " yeah " - usher featuring ludacris & lil'jon march 22 : " yeah " - usher featuring ludacris & lil'jon march 23 : " yeah " - usher featuring ludacris & lil'jon march 24 : " yeah " - usher featuring ludacris & lil'jon march 25 : " my band " - d12 march 26 : " yeah " - usher featuring ludacris & lil'jon march 29 : " yeah " - usher featuring ludacris & lil'jon march 30 : " yeah " - usher featuring ludacris & lil'jon march 31 : " my band " - d12 april 1 : " yeah " - usher featuring ludacris & lil'jon april 5 : " naughty girl " - beyonce april 6 : " yeah " - usher featuring ludacris & lil'jon april 7 : " yeah " - usher featuring ludacris & lil'jon april 8 : " the way " - clay aiken april 9 : " yeah " - usher featuring ludacris & lil'jon april 12 : " yeah " - usher featuring ludacris & lil'jon april 13 : " yeah " - usher featuring ludacris & lil'jon april 14 : " roses " - outkast april 15 : " the way " - clay aiken april 16 : " everytime " - britney spears april 19 : " everytime " - britney spears april 20 : " roses " - outkast april 21 : " everytime " - britney spears april 22 : " everytime " - britney spears april 23 : " my band " - d12 april 26 : " everytime " - britney spears april 27 : " my band " - d12 april 28 : " my band " - d12 april 29 : " everytime " - britney spears may 3 : " my band " - d12 may 4 : " my band " - d12 may 5 : " my band " - d12 may 6 : " my band " - d12 may 7 : " my band " - d12 may 10 : " my band " - d12 may 11 : " my band " - d12 may 17 : " my band " - d12 may 18 : " burn " - usher may 19 : " burn " - usher may 20 : " burn " - usher may 21 : " burn " - usher may 24 : " my band " - d12 may 25 : " burn " - usher may 26 : " burn " - usher may 27 : " my band " - d12 may 28 : " my band " - d12 may 31 : summer anthems # 1 video : " hot in herre " - nelly june 1 : " burn " - usher june 2 : " burn " - usher june 3 : " all downhill from here " - new found glory june 4 : " my band " - d12 june 7 : " my band " - d12 june 8 : " burn " - usher june 9 : " all downhill from here " - new found glory june 10 : cross country countdown # 1 video : " forgot about dre " - dr. dre featuring eminem june 11 : " burn " - usher june 14 : " burn " - usher june 15 : " how come " - d12 june 16 : " how come " - d12 june 17 : " how come " - d12 june 18 : " how come " - d12 june 21 : " how come " - d12 june 22 : " how come " - d12 june 23 : " leave ( get out ) " - jojo june 24 : " how come " - d12 june 25 : " how come " - d12 june 28 : " leave ( get out ) " - jojo june 29 : " how come " - d12 june 30 : " confessions pt.

Were such candidates rendered to the Wizard during the conversation? Seems like a lot of content to render on their screen.

Also, how do you handle such candidates when training the knowledge selection model? It seems some of these candidates have upwards of 4000 tokens! Would you pad all candidates to the maximum sequence length (which would be like 4000 based on this example)?

g-karthik commented 4 years ago

Also, I tried searching for the ground-truth label (which I presume is this) in the label_cands list, by doing label_cands.index(labels[0]) within a try-except block, and I found that there are cases where labels[0] does not actually exist in the list of label_cands.

How do you handle such cases where the ground-truth label does not exist in the set of candidates?

stephenroller commented 4 years ago

We truncate the knowledge, I think to 128 tokens.

stephenroller commented 4 years ago

There might be a case where the true knowledge isn’t in the candidates. We had that problem before and thought we got them all.

g-karthik commented 4 years ago

We truncate the knowledge, I think to 128 tokens.

You mean you truncate each candidate down to 128 tokens as a dataset-wide rule?

There might be a case where the true knowledge isn’t in the candidates. We had that problem before and thought we got them all.

Yes, there are multiple such cases, I wouldn't have encountered them when running my code otherwise. I'm just dropping those examples from training for now, but you should probably fix this in the public version of the dataset.

stephenroller commented 4 years ago

Agreed, thanks for reporting

I think what we did was just add it to the list. Can you say which specific teacher you’re using?

As for the truncate, we do that in our agents. End2end uses the argument —knowledge-truncate I think.

g-karthik commented 4 years ago

I wasn't using a specific teacher when I found these issues, this was using some code I wrote that was based off the WizardDialogKnowledgeTeacher.

g-karthik commented 4 years ago

In the code snippet here, shouldn't lines 306 and 309 also check self.knowledge_separator and add TOKEN_KNOWLEDGE if true, just like lines 313-314?

g-karthik commented 4 years ago

We truncate the knowledge, I think to 128 tokens.

You mean you truncate each candidate down to 128 tokens as a dataset-wide rule?

There might be a case where the true knowledge isn’t in the candidates. We had that problem before and thought we got them all.

Yes, there are multiple such cases, I wouldn't have encountered them when running my code otherwise. I'm just dropping those examples from training for now, but you should probably fix this in the public version of the dataset.

@stephenroller actually unless I'm missing something, there's a lot of such cases. I was working with a much smaller sample of the dataset before.

stephenroller commented 4 years ago

I think something is wrong. We fixed this, and I’m certain it only occurs at most once or twice in the entire dataset. Maybe @klshuster remembers the details of his fix.

g-karthik commented 4 years ago

Hmm it's possible that the version of the dataset I'm working with is outdated, although I'm not sure how I would be able to tell. Did you have a version associated with each update to the public tgz file?

On Fri, Feb 21, 2020, 5:24 PM Stephen Roller notifications@github.com wrote:

I think something is wrong. We fixed this, and I’m certain it only occurs at most once or twice in the entire dataset. Maybe @klshuster https://github.com/klshuster remembers the details of his fix.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/facebookresearch/ParlAI/issues/2417?email_source=notifications&email_token=AA5MNWI2HVU22SNSXZNF2SLREB5E3A5CNFSM4KYDECC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMUS2CQ#issuecomment-589901066, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA5MNWM4UYVS74XOX46CHILREB5E3ANCNFSM4KYDECCQ .

klshuster commented 4 years ago

We haven't updated the raw data itself so that wouldn't be the issue. For what it's worth, the default label is the response, not the chosen knowledge. I am not sure what command you are running but if you are using the WizardDialogKnowledgeTeacher you'll need to specify --label-type chosen_sent to get the knowledge in the label.

I ran python parlai/scripts/verify_data.py -t wizard_of_wikipedia:WizardDialogKnowledge --label-type chosen_sent to double check that everything is correct and found that only 1 example had a label missing from the candidate set:

{"missing_text": 0, "missing_labels": 0, "missing_label_candidates": 0, "empty_string_label_candidates": 0, "label_candidates_with_missing_label": 1, "did_not_return_message": 0, "exs": 74092, "%done": "100.00%", "time_left": "0s"}
g-karthik commented 4 years ago

@klshuster like I mentioned earlier in this thread, I wasn't running any particular ParlAI script when I found these issues. I wrote some code for knowledge selection outside the ParlAI framework that was based off the WizardDialogKnowledgeTeacher. I think perhaps like @stephenroller mentioned earlier, missing labels are added to the candidate list before the verification part of verify_data.py kicks in? Although on the flip side, if that were the case, you shouldn't find even 1 example like you do above. Either way, I'll dig into my code a little more and probably just add the missing label to the candidate list.

In the code snippet here, shouldn't lines 306 and 309 also check self.knowledge_separator and add TOKEN_KNOWLEDGE if true, just like lines 313-314?

could either of you address this question as well?

klshuster commented 4 years ago

regarding your question about lines 306/309, yes i believe you are right it should have the same check for self.knowledge_separator

regarding the missing label in the candidate set - are you sure that you are populating the full candidate set correctly? the wizard has access to knowledge from three sources - 1. knowledge retrieved from the previous apprentice utterance; 2. knowledge retrieved from the previous wizard utterance; and 3. knowledge retrieved from the chosen_topic. Nothing is being done to the example beyond what is being done in the WizardDialogKnowledgeTeacher.get before verifying the data in verify_data.py

g-karthik commented 4 years ago

@klshuster yeah I'm doing something like this:

for element in data:
   dialog = element["dialog"]
   chosen_topic = element.get("chosen_topic", "")
   chosen_topic_passage = element["chosen_topic_passage"]

   for idx, turn in enumerate(dialog):
        speaker = turn["speaker"]
        if "wizard" in speaker.lower():
               # create examples only for wizard turns
               apprentice_ret_passages = wizard_ret_passages = {}
               if idx != 0:
                   apprentice_entry = dialog[idx - 1]
                   apprentice_ret_passages = apprentice_entry["retrieved_passages"]
               if idx - 2 >= 0:
                   wizard_prev_entry = dialog[idx - 2]
                   wizard_ret_passages = wizard_prev_entry["retrieved_passages"]

               knowledge_dict = {chosen_topic: chosen_topic_passage}
               for ret_passes in [apprentice_ret_passages, wizard_ret_passages]:
                    for passage in ret_passes:
                        for k, v in passage.items():
                           if k not in knowledge_dict.keys():
                              knowledge_dict[k] = v

              wizard_entry = turn
              # more code below to get chosen title and sent, as well as populate label_cands from knowledge_dict
klshuster commented 4 years ago

could you please flag some examples where you find that the gold knowledge label is missing from the knowledge candidates? and also could you give a rough estimate to the percentage of the data with which you are experiencing this situation?

g-karthik commented 4 years ago

@klshuster I think I figured out the issue. I took just the test_random_split.json file and ran the code, with some counters this time:

# of examples with label in the candidate list = 4110
# of examples without label in the candidate list = 246
Total # of examples = 4110 + 246 = 4356

I picked the first example that belonged to the above 246 from my logs: Label not in list of candidates, hence skipping example! See dialog corresponding to chosen_topic The Rolling Stones, persona i like the group the rolling stones., wizard_eval=5, turn idx=4

I checked out the dialog in the file and the 5th turn (corresponding to turn idx=4 above):

                "checked_sentence": {
                    "no_passages_used": "no_passages_used"
                },
                "checked_passage": {
                    "no_passages_used": "no_passages_used"
                }

The issue was that I had created a special token called <no_passages_used>, not no_passages_used. And I was using this special token as-is inside the _get_chosen_title_and_sent() method, without accounting for the fact that the data files actually don't have the <> tags surrounding the token. I removed the <> tags from my special token and all examples in the test set seem to have the label in the candidate list now!

klshuster commented 4 years ago

Great, I'm glad you identified the problem. I'll go ahead and close this issue as it appears everything is resolved, feel free to reopen if you are still stuck.

g-karthik commented 4 years ago

In the code snippet here, shouldn't lines 306 and 309 also check self.knowledge_separator and add TOKEN_KNOWLEDGE if true, just like lines 313-314?

if you guys plan to fix this, feel free to reference this issue and then close it!

g-karthik commented 4 years ago

@klshuster whoops, looks like you closed the issue at the same time I posted the above -- and I can't seem to reopen the issue.

klshuster commented 4 years ago

Closing, as fixed in #2437

g-karthik commented 4 years ago

@klshuster in the raw data, did you fix that 1 example you found with verify_data.py that had the label missing from the candidate set?

The issue was that I had created a special token called <no_passages_used>, not no_passages_used. And I was using this special token as-is inside the _get_chosen_title_and_sent() method, without accounting for the fact that the data files actually don't have the <> tags surrounding the token. I removed the <> tags from my special token and all examples in the test set seem to have the label in the candidate list now!

With the above fix I'm referring to, I tried the train.json file this time (I tried test_random_split.json the last time and it was fine!), and found 3 examples with label missing from the candidate set:

INFO: Label not in list of candidates, hence skipping example! See dialog corresponding to chosen_topic Ford Motor Company, persona i drive a ford truck., wizard_eval=4, turn idx=8
INFO: Label not in list of candidates, hence skipping example! See dialog corresponding to chosen_topic Hamburger, persona i love hamburgers., wizard_eval=3, turn idx=6
INFO: Label not in list of candidates, hence skipping example! See dialog corresponding to chosen_topic Hamburger, persona i love hamburgers., wizard_eval=3, turn idx=8

I tried to look for the dialog corresponding to the first log above in the train.json and I actually see 8 dialogs satisfying the constraints chosen_topic Ford Motor Company, persona i drive a ford truck., wizard_eval=4. I didn't bother checking each of the 8 dialogs, but perhaps you should take a look.

Also, for the remaining two logs, I see there are 3 dialogs satisfying those constraints.

klshuster commented 4 years ago

Hi @g-karthik, given the small number of examples relative to the size of the dataset in which this issue occurs, we currently have no plans to modify the raw data.