Closed g-karthik closed 4 years ago
Also, I tried searching for the ground-truth label (which I presume is this) in the label_cands
list, by doing label_cands.index(labels[0])
within a try-except block, and I found that there are cases where labels[0]
does not actually exist in the list of label_cands
.
How do you handle such cases where the ground-truth label does not exist in the set of candidates?
We truncate the knowledge, I think to 128 tokens.
There might be a case where the true knowledge isn’t in the candidates. We had that problem before and thought we got them all.
We truncate the knowledge, I think to 128 tokens.
You mean you truncate each candidate down to 128 tokens as a dataset-wide rule?
There might be a case where the true knowledge isn’t in the candidates. We had that problem before and thought we got them all.
Yes, there are multiple such cases, I wouldn't have encountered them when running my code otherwise. I'm just dropping those examples from training for now, but you should probably fix this in the public version of the dataset.
Agreed, thanks for reporting
I think what we did was just add it to the list. Can you say which specific teacher you’re using?
As for the truncate, we do that in our agents. End2end uses the argument —knowledge-truncate I think.
I wasn't using a specific teacher when I found these issues, this was using some code I wrote that was based off the WizardDialogKnowledgeTeacher
.
In the code snippet here, shouldn't lines 306 and 309 also check self.knowledge_separator
and add TOKEN_KNOWLEDGE
if true, just like lines 313-314?
We truncate the knowledge, I think to 128 tokens.
You mean you truncate each candidate down to 128 tokens as a dataset-wide rule?
There might be a case where the true knowledge isn’t in the candidates. We had that problem before and thought we got them all.
Yes, there are multiple such cases, I wouldn't have encountered them when running my code otherwise. I'm just dropping those examples from training for now, but you should probably fix this in the public version of the dataset.
@stephenroller actually unless I'm missing something, there's a lot of such cases. I was working with a much smaller sample of the dataset before.
I think something is wrong. We fixed this, and I’m certain it only occurs at most once or twice in the entire dataset. Maybe @klshuster remembers the details of his fix.
Hmm it's possible that the version of the dataset I'm working with is outdated, although I'm not sure how I would be able to tell. Did you have a version associated with each update to the public tgz file?
On Fri, Feb 21, 2020, 5:24 PM Stephen Roller notifications@github.com wrote:
I think something is wrong. We fixed this, and I’m certain it only occurs at most once or twice in the entire dataset. Maybe @klshuster https://github.com/klshuster remembers the details of his fix.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/facebookresearch/ParlAI/issues/2417?email_source=notifications&email_token=AA5MNWI2HVU22SNSXZNF2SLREB5E3A5CNFSM4KYDECC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMUS2CQ#issuecomment-589901066, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA5MNWM4UYVS74XOX46CHILREB5E3ANCNFSM4KYDECCQ .
We haven't updated the raw data itself so that wouldn't be the issue. For what it's worth, the default label
is the response, not the chosen knowledge. I am not sure what command you are running but if you are using the WizardDialogKnowledgeTeacher
you'll need to specify --label-type chosen_sent
to get the knowledge in the label.
I ran python parlai/scripts/verify_data.py -t wizard_of_wikipedia:WizardDialogKnowledge --label-type chosen_sent
to double check that everything is correct and found that only 1 example had a label missing from the candidate set:
{"missing_text": 0, "missing_labels": 0, "missing_label_candidates": 0, "empty_string_label_candidates": 0, "label_candidates_with_missing_label": 1, "did_not_return_message": 0, "exs": 74092, "%done": "100.00%", "time_left": "0s"}
@klshuster like I mentioned earlier in this thread, I wasn't running any particular ParlAI script when I found these issues. I wrote some code for knowledge selection outside the ParlAI framework that was based off the WizardDialogKnowledgeTeacher
. I think perhaps like @stephenroller mentioned earlier, missing labels are added to the candidate list before the verification part of verify_data.py
kicks in? Although on the flip side, if that were the case, you shouldn't find even 1 example like you do above. Either way, I'll dig into my code a little more and probably just add the missing label to the candidate list.
In the code snippet here, shouldn't lines 306 and 309 also check
self.knowledge_separator
and addTOKEN_KNOWLEDGE
if true, just like lines 313-314?
could either of you address this question as well?
regarding your question about lines 306/309, yes i believe you are right it should have the same check for self.knowledge_separator
regarding the missing label in the candidate set - are you sure that you are populating the full candidate set correctly? the wizard has access to knowledge from three sources - 1. knowledge retrieved from the previous apprentice utterance; 2. knowledge retrieved from the previous wizard utterance; and 3. knowledge retrieved from the chosen_topic
. Nothing is being done to the example beyond what is being done in the WizardDialogKnowledgeTeacher.get
before verifying the data in verify_data.py
@klshuster yeah I'm doing something like this:
for element in data:
dialog = element["dialog"]
chosen_topic = element.get("chosen_topic", "")
chosen_topic_passage = element["chosen_topic_passage"]
for idx, turn in enumerate(dialog):
speaker = turn["speaker"]
if "wizard" in speaker.lower():
# create examples only for wizard turns
apprentice_ret_passages = wizard_ret_passages = {}
if idx != 0:
apprentice_entry = dialog[idx - 1]
apprentice_ret_passages = apprentice_entry["retrieved_passages"]
if idx - 2 >= 0:
wizard_prev_entry = dialog[idx - 2]
wizard_ret_passages = wizard_prev_entry["retrieved_passages"]
knowledge_dict = {chosen_topic: chosen_topic_passage}
for ret_passes in [apprentice_ret_passages, wizard_ret_passages]:
for passage in ret_passes:
for k, v in passage.items():
if k not in knowledge_dict.keys():
knowledge_dict[k] = v
wizard_entry = turn
# more code below to get chosen title and sent, as well as populate label_cands from knowledge_dict
could you please flag some examples where you find that the gold knowledge label is missing from the knowledge candidates? and also could you give a rough estimate to the percentage of the data with which you are experiencing this situation?
@klshuster I think I figured out the issue. I took just the test_random_split.json
file and ran the code, with some counters this time:
# of examples with label in the candidate list = 4110
# of examples without label in the candidate list = 246
Total # of examples = 4110 + 246 = 4356
I picked the first example that belonged to the above 246 from my logs:
Label not in list of candidates, hence skipping example! See dialog corresponding to chosen_topic The Rolling Stones, persona i like the group the rolling stones., wizard_eval=5, turn idx=4
I checked out the dialog in the file and the 5th turn (corresponding to turn idx=4 above):
"checked_sentence": {
"no_passages_used": "no_passages_used"
},
"checked_passage": {
"no_passages_used": "no_passages_used"
}
The issue was that I had created a special token called <no_passages_used>
, not no_passages_used
. And I was using this special token as-is inside the _get_chosen_title_and_sent()
method, without accounting for the fact that the data files actually don't have the <>
tags surrounding the token. I removed the <>
tags from my special token and all examples in the test set seem to have the label in the candidate list now!
Great, I'm glad you identified the problem. I'll go ahead and close this issue as it appears everything is resolved, feel free to reopen if you are still stuck.
In the code snippet here, shouldn't lines 306 and 309 also check
self.knowledge_separator
and addTOKEN_KNOWLEDGE
if true, just like lines 313-314?
if you guys plan to fix this, feel free to reference this issue and then close it!
@klshuster whoops, looks like you closed the issue at the same time I posted the above -- and I can't seem to reopen the issue.
Closing, as fixed in #2437
@klshuster in the raw data, did you fix that 1 example you found with verify_data.py
that had the label missing from the candidate set?
The issue was that I had created a special token called
<no_passages_used>
, notno_passages_used
. And I was using this special token as-is inside the_get_chosen_title_and_sent()
method, without accounting for the fact that the data files actually don't have the<>
tags surrounding the token. I removed the<>
tags from my special token and all examples in the test set seem to have the label in the candidate list now!
With the above fix I'm referring to, I tried the train.json
file this time (I tried test_random_split.json
the last time and it was fine!), and found 3 examples with label missing from the candidate set:
INFO: Label not in list of candidates, hence skipping example! See dialog corresponding to chosen_topic Ford Motor Company, persona i drive a ford truck., wizard_eval=4, turn idx=8
INFO: Label not in list of candidates, hence skipping example! See dialog corresponding to chosen_topic Hamburger, persona i love hamburgers., wizard_eval=3, turn idx=6
INFO: Label not in list of candidates, hence skipping example! See dialog corresponding to chosen_topic Hamburger, persona i love hamburgers., wizard_eval=3, turn idx=8
I tried to look for the dialog corresponding to the first log above in the train.json
and I actually see 8 dialogs satisfying the constraints chosen_topic Ford Motor Company, persona i drive a ford truck., wizard_eval=4
. I didn't bother checking each of the 8 dialogs, but perhaps you should take a look.
Also, for the remaining two logs, I see there are 3 dialogs satisfying those constraints.
Hi @g-karthik, given the small number of examples relative to the size of the dataset in which this issue occurs, we currently have no plans to modify the raw data.
Hello!
I'm trying to parse examples to train a knowledge selection model with the Wizard of Wikipedia dataset. I'm creating label candidates by following the code here.
I am finding that there are some cases where some candidates fetched using the code above have > 4000 tokens. Here's a couple sample candidates I printed from the
label_cands
list in the above linked code.Search for
Index of philatelic articles
intrain.json
and you will find the below in the list of candidates.Another case:
Search for
TRL's Number Ones
intrain.json
and you will find the below in the list of candidates.Were such candidates rendered to the Wizard during the conversation? Seems like a lot of content to render on their screen.
Also, how do you handle such candidates when training the knowledge selection model? It seems some of these candidates have upwards of 4000 tokens! Would you pad all candidates to the maximum sequence length (which would be like 4000 based on this example)?