IDEA-FinAI / ToG

This is the official github repo of Think-on-Graph. If you are interested in our work or willing to join our research team in Shenzhen, please feel free to contact us by email (xuchengjin@idea.edu.cn)
348 stars 39 forks source link

Clarification and Guidance Request on CWQ Dataset Preprocessing #17

Open FUTUREEEEEE opened 7 months ago

FUTUREEEEEE commented 7 months ago

Thank you for sharing the processed CWQ dataset in your repository. I've observed that the test set contains 1,203 samples, contrary to the 3,531 samples mentioned in the readme. Could you clarify this discrepancy?

I'm also seeking guidance on the dataset's preprocessing steps, particularly on extracting entities from questions and converting them into Freebase IDs.

Additionally, I noticed many duplicate questions with the same WebQSP_ID in the dataset. Could you explain the reason behind this?

Your insights on these matters would be greatly appreciated.

GasolSun36 commented 7 months ago

Hi,

1.If I check the cwq dataset file correctly, there is actually 3,531 samples in the dataset file.

2.Here is the pipeline of preprocessing the dataset: First, prompt the LLM to extract the entity. Second, use Wikidata API we defined in the Wikidata' to convert name into Qid (label2qid). Third, useWikidata APIwe defined in theWikidata' to convert qid into Mid (qid2mid).

3.Because some samples of the cwq testset may be from webqsp. However, this is the construction of the dataset, nothing to do with our algorithm, please refer their paper for more details.