PhilippChr / CLOCQ

Code for our WSDM 2022 paper. CLOCQ is a framework which allows efficient access to knowledge bases (KB) for functionalities related to question answering (QA). CLOCQ can retrieve a set of relevant facts from the KB for a given user question. Further, it provides efficient retrieval of KB-neighborhoods, KB-connectivities and labels, aliases etc.
https://clocq.mpi-inf.mpg.de
MIT License
15 stars 3 forks source link

stuck in Dynamic usage of CLOCQ - within a single script: #4

Closed sarthakgupta-sg closed 1 year ago

sarthakgupta-sg commented 1 year ago

Hi. So, I followed the steps mentioned in the readme file to setup everything. But when I run the Dynamic usage of CLOCQ section, I get : Dictionaries successfully loaded. KB loading started.

And its stuck there for 2 hours. Does this loading takes lot of time initially or am I missing something?

PhilippChr commented 1 year ago

Hi, yes, if you choose to load CLOCQ dynamically, this can lead to loading times of up to 2 hours, since the KB index is constructed on-the-fly. Runtimes will afterwards be faster than with the public API. But if the starting time is a bottleneck for you, going with the public API might be preferable. Regards, Philipp

sarthakgupta-sg commented 1 year ago

Will loading take this much long even when trying to fetch triples for one question as mentioned in the readme:

from clocq import CLOCQ

TAGME_TOKEN = "<INSERT YOUR TOKEN HERE>"
cl = CLOCQ(tagme_token=TAGME_TOKEN)

res = cl.get_label("Q5")
print(res)

res = cl.get_search_space("who was the screenwriter for Crazy Rich Asians?")
print(res.keys())
PhilippChr commented 1 year ago

Unfortunately, yes.

The problem is, that even for a single question, you need to load the full KB, i.e. the full KB-index. Loading the index once would take 2-3 hours, depending on the exact KB dump you are using.

Regards, Philipp

sarthakgupta-sg commented 1 year ago

Oh, okay. Also, since wikidata knowledge base is updated frequently. Do, I need to make changes in the kb or will the previous timestamp as used here of 2020 will work fine?

PhilippChr commented 1 year ago

This depends on what you want to do with the KB. The KB linked in this repo is from 2020, but we also have a dump from 2022 here: https://github.com/PhilippChr/wikidata-core-for-QA.

If you need the latest data (available), then this might be better. If you want to compare with some method or so, this might be different. The 2022 dump is also a bit larger, and might take more time and RAM to load.

sarthakgupta-sg commented 1 year ago

So, does this link https://github.com/PhilippChr/wikidata-core-for-QA have precomputed triples for every question in Lcquad 2.0? My primary aim is to fetch as many triples as possible for every question in mintaka dataset. In readme file of the github link shared by you, says that you form n-triples, so what does that exactly means?

sarthakgupta-sg commented 1 year ago

Also, do I need to load kb every time I run the script? If yes, is there any alternative way? Like, I need to fetch triples for every question in mintaka for train, dev, and test. It would then be very time and computationally expensive.

PhilippChr commented 1 year ago

No, the link only has the KB.

If your primary goal is to retrieve relevant KB-facts for every question in LC-QuAD 2.0, I would suggest simply using our public API. You do not need to take care of setting up CLOCQ at your end, and can directly use the 2022 dump. Further, you can control the amount of facts with parameters k and p, as outlined in the CLOCQ paper.

See example code below:

from clocq.interface.CLOCQInterfaceClient import CLOCQInterfaceClient

clocq = CLOCQInterfaceClient(host="https://clocq.mpi-inf.mpg.de/api", port="443")

question = "who plays Viserys in GRRM's latest HBO series?"

# default parameters
res = clocq.get_search_space(question)
print(f'Retrieved {len(res["search_space"])} facts for the question with default params')

# k=5
res = clocq.get_search_space(question, parameters={"k": 5})
print(f'Retrieved {len(res["search_space"])} facts for the question, with k=5')

# k=10
res = clocq.get_search_space(question, parameters={"k": 10})
print(f'Retrieved {len(res["search_space"])} facts for the question, with k=10')

question = "who is the coach of Barcelona?"

# default parameters
res = clocq.get_search_space(question)
print(f'Retrieved {len(res["search_space"])} facts for the question with default params')

# k=5, p=10
res = clocq.get_search_space(question, parameters={"k": 5, "p_setting": 10})
print(f'Retrieved {len(res["search_space"])} facts for the question, with k=5, p=10')

# k=5, p=10k
res = clocq.get_search_space(question, parameters={"k": 5, "p_setting": 100000})
print(f'Retrieved {len(res["search_space"])} facts for the question, with k=5, p=10000')

# k=10, p=10
res = clocq.get_search_space(question, parameters={"k": 10, "p_setting": 10})
print(f'Retrieved {len(res["search_space"])} facts for the question, with k=10, p=10')

# k=10, p=10k
res = clocq.get_search_space(question, parameters={"k": 10, "p_setting": 100000})
print(f'Retrieved {len(res["search_space"])} facts for the question, with k=10, p=10000')
PhilippChr commented 1 year ago

You can load the whole data and KB once, and then run it on the full data at once

sarthakgupta-sg commented 1 year ago

Thanks. I'll just try out the above code and reach back to you if I need further help. Thank you for taking out time.

sarthakgupta-sg commented 1 year ago

From the above mentioned code, where the output can be seen. It's not in the result directory.

PhilippChr commented 1 year ago

I am not clear about your question. You can just copy the code, and access the output via the variables, store it on disk etc.

E.g. you can iterate through the first 5 KB facts via:

for fact in res["search_space"][:5]:
    print(fact)
sarthakgupta-sg commented 1 year ago

Hi. So, I was trying to retrieve facts for the question as mentioned in the paper:

Q: Who scored in the 2018 final between France and Croatia?

Facts mentioned in the paper: {⟨ 2018 FIFA World Cup Final, goal scored by, Paul Pogba; for team, France football team ⟩, ⟨ 2018 FIFA World Cup Final, goal scored by, Ivan Perisic; for team, Croatia football team ⟩, . . .}

Facts that I am able to generate for different parameters:

Retrieved 495 facts for the question with default params [{'id': 'Q31043671', 'label': '2018 FIFA World Cup Final'}, {'id': 'P31', 'label': 'instance of'}, {'id': 'Q12708896', 'label': 'FIFA World Cup final'}] [{'id': 'Q31043671', 'label': '2018 FIFA World Cup Final'}, {'id': 'P585', 'label': 'point in time'}, {'id': '"2018-07-15T00:00:00Z"', 'label': '15 July "2018'}] [{'id': 'Q31043671', 'label': '2018 FIFA World Cup Final'}, {'id': 'P1923', 'label': 'participating team'}, {'id': 'Q43249937', 'label': 'France at the 2018 FIFA World Cup\u200E'}] [{'id': 'Q31043671', 'label': '2018 FIFA World Cup Final'}, {'id': 'P1363', 'label': 'points/goal scored by'}, {'id': 'Q455462', 'label': 'Antoine Griezmann'}] [{'id': 'Q31043671', 'label': '2018 FIFA World Cup Final'}, {'id': 'P641', 'label': 'sport'}, {'id': 'Q2736', 'label': 'soccer'}] Retrieved 2977 facts for the question, with k=5 [{'id': 'Q31043671', 'label': '2018 FIFA World Cup Final'}, {'id': 'P31', 'label': 'instance of'}, {'id': 'Q12708896', 'label': 'FIFA World Cup final'}] [{'id': 'Q31043671', 'label': '2018 FIFA World Cup Final'}, {'id': 'P585', 'label': 'point in time'}, {'id': '"2018-07-15T00:00:00Z"', 'label': '15 July "2018'}] [{'id': 'Q31043671', 'label': '2018 FIFA World Cup Final'}, {'id': 'P1923', 'label': 'participating team'}, {'id': 'Q43249937', 'label': 'France at the 2018 FIFA World Cup\u200E'}] [{'id': 'Q31043671', 'label': '2018 FIFA World Cup Final'}, {'id': 'P1363', 'label': 'points/goal scored by'}, {'id': 'Q455462', 'label': 'Antoine Griezmann'}] [{'id': 'Q31043671', 'label': '2018 FIFA World Cup Final'}, {'id': 'P641', 'label': 'sport'}, {'id': 'Q2736', 'label': 'soccer'}] Retrieved 5730 facts for the question, with k=10 [{'id': 'Q31043671', 'label': '2018 FIFA World Cup Final'}, {'id': 'P31', 'label': 'instance of'}, {'id': 'Q12708896', 'label': 'FIFA World Cup final'}] [{'id': 'Q31043671', 'label': '2018 FIFA World Cup Final'}, {'id': 'P585', 'label': 'point in time'}, {'id': '"2018-07-15T00:00:00Z"', 'label': '15 July "2018'}] [{'id': 'Q31043671', 'label': '2018 FIFA World Cup Final'}, {'id': 'P1923', 'label': 'participating team'}, {'id': 'Q43249937', 'label': 'France at the 2018 FIFA World Cup\u200E'}] [{'id': 'Q31043671', 'label': '2018 FIFA World Cup Final'}, {'id': 'P1363', 'label': 'points/goal scored by'}, {'id': 'Q455462', 'label': 'Antoine Griezmann'}] [{'id': 'Q31043671', 'label': '2018 FIFA World Cup Final'}, {'id': 'P641', 'label': 'sport'}, {'id': 'Q2736', 'label': 'soccer'}]

PhilippChr commented 1 year ago

The facts are not sorted in the list. In the paper, we just show the relevant results given the limited space. So you would need to process these facts in the remainder of your QA pipeline. Also, note that the examples in the paper are for illustration and motivation purposes. The exact results may be (slightly) different.

sarthakgupta-sg commented 1 year ago

One more thing, I wanted to ask. I have to retrieve facts form questions from other dataset (Mintaka), but public APi are not resulting in good triples. Any thoughts?

sarthakgupta-sg commented 1 year ago

Hey, any idea?

sarthakgupta-sg commented 1 year ago

When I am trying to load clocq for new dataset, I run the command nohup bash initialize.sh & Output I get:

[2] 37944 (env) gupta@ltgpu2:~/mintaka/CLOCQ/clocq$ nohup: ignoring input and appending output to 'nohup.out'

What does this mean?

PhilippChr commented 1 year ago

You can increase the value of k and p, and see if the answer presence is improving.

Regarding the message in the console you shared, this is just the standard output line of the „nohup“ command. The log is then written to „nohup.out“.

PhilippChr commented 1 year ago

I guess this issue can be closed. Please let me know if you face any other problems!