Step5 KeyError occurred

JasonCen-sweetdreams commented 2 years ago

When I run simplify_dataset.py, it turns out that there are some irregular keys in json file train.json. Data in this file is splited automatically from CWQ_step1_01.json using train_test_split from sklearn.

The strange keys include: ' royalty' which is supposed to be 'royalty' , ' Using various tricks of light, perspective and erasure...' (both are in entity2id[obj['text']])

I guess there might be some errors in CWQ/subgraph/subgraph_hop2.txt or preprocess_step1.py. And if needed I can provide you with the code I separate the data from CWQ_step1_01.json.

The output is here:

Traceback (most recent call last):
  File "simplify_dataset.py", line 70, in <module>
    simplify_data(input_file, output_file, entity2id, relation2id)
  File "simplify_dataset.py", line 42, in simplify_data
    tp_dict["subgraph"]["tuples"] = simplify_tuples(tp_dict["subgraph"]["tuples"], entity2id, relation2id)
  File "simplify_dataset.py", line 27, in simplify_tuples
    tail = entity2id[obj['text']]
KeyError: '       Using various tricks of light, perspective and erasure, the artworks in Shadows, Disappearances and Illusions each short-circuit the connection between the eye and the brain.'

RichardHGL commented 2 years ago

You can first check the vocabulary (entity2id dict), and check what is missing. I think the code can work if the vocabulary is correct. Meanwhile, there is some difference between Meta preprocessing and Freebase preprocessing. I think for CWQ (Freebase), it's step 7. Did you miss step 6?

JasonCen-sweetdreams commented 2 years ago

In python file simplify_dataset.py, you use strip() in line 49, while in the first method simplify_entities(entity_list, entity2id) you didn't use strip(), that cause the error. I think the correction is entity_text = entity['text'].strip()

RichardHGL commented 2 years ago

Yeah, it may cause this error, thanks for your feedback, I fixed it.

RichardHGL / WSDM2021_NSM

Step5 KeyError occurred #13