RichardHGL / WSDM2021_NSM

Improving Multi-hop Knowledge Base Question Answering by Learning Intermediate Supervision Signals. WSDM 2021.
130 stars 22 forks source link

subgraph coverage for webqsp #6

Closed dinani65 closed 3 years ago

dinani65 commented 3 years ago

Hello, I am using GraftNet preprocessing to generate question subgraphs for webqsp dataset. The number of questions is 4737 and the number of subgraphs which cover the answer entity is equal to 4392. So, the recall will be around 0.89. Could u please explain me how to achieve 0.94% coverage as u have mentioned in the table.

RichardHGL commented 3 years ago

You can check the preprocessing folder. We use cvtnodes to include triples, this makes a difference.

dinani65 commented 3 years ago

Thanks for yr reply. Yes, where can I find .json files for the datasets (such as CWQ_step0.json) to create seed files and follow the next steps?

RichardHGL commented 3 years ago

You can see preprocessing/Freebase/preprocess_step0.py, after downloading data from https://github.com/lanyunshi/KBQA-GST. By run preprocess_step0.py, you can get CWQ_step0.json.

dinani65 commented 3 years ago

Thanks a bunch. What about the link to download data for webqsp and MetaQA-1hop? could u please upload the .json output of the step_0 for all datasets? Then, we can use them to run PPR or any ranking algorithm. Thanks in advance,

RichardHGL commented 3 years ago

Actually I suggest downloading the final version from the provided link in our page instead of re-run the process. I don't remember exactly how I get the step_0 for webqsp, I may just get question, topic entities, answers from GraftNet or original dataset. You may try to get it yourself. About MetaQA, I think you'd better first have a look at preprocessing/MetaQA folder. It's much easier than Freebase based process.

dinani65 commented 3 years ago

Many thanks for your reply. I intend to use a different algorithm instead of PPR to retrieve the question subgraphs. So, I have to re-run the process by the step_1. Since using cvtnodes requires much memory, I had asked u regarding uploading the required files for all the steps which should be run before calling the function retrieve_subgraph. One question, after running Step2:extract 2-hop neighborhood for topic entities in questions, three files are created: ent_hop1.txt, subgraph_hop1.txt and subgraph_hop2.txt. Which one should be considered as input for the next step? I guess we should only put the path. In your document, it has been set "CWQ/subgraph/CWQ_subgraph.txt"

RichardHGL commented 3 years ago

Yes, CWQ/subgraph/CWQ_subgraph.txt is the file to store all triples reserved in 2-hop neighborhood.

dinani65 commented 3 years ago

Does it take so long to run preprocess_step1.py? it has not finished for CWQ since 15 hours ago!

RichardHGL commented 3 years ago

It seems to be a little long as ppr algorithm may cost much time.

dinani65 commented 3 years ago

I greatly appreciate if u can check the code to retrieve the neighbors! I try to create a separate fact file for each question by using your code. It is too slow while I am running it on an HPC cluster.

dinani65 commented 3 years ago

I have been waiting for your answer yet! The code to create extract 2-hop neighborhood for topic entities in questions does not work well beacuse of too long running time, even on hpc clusters. Please let me know if there is any solution to address it!

RichardHGL commented 3 years ago

You'd better try to solve it yourself. I don't have time to look at it recently.

dinani65 commented 3 years ago

I do apologize if my questions take your time and I greatly appreciate it if you can answer another question of mine :) I have studied the paper related to PRN which u have mentioned in your paper. Besides, I am looking at code what represents the PPR (same as the corresponding function in GRAFT-NET source code). my question is that does PRN make difference in the results (increasing recall)? (In previous comment, u mentioned that suing cvtnodes to include triples, makes a difference)

Many thanks in advance,

RichardHGL commented 3 years ago

Actually, we use the same PRN algorithm. Only triples were obtained with extra CVT processing.