Closed sarthakgupta-sg closed 1 year ago
Hi, yes, if you choose to load CLOCQ dynamically, this can lead to loading times of up to 2 hours, since the KB index is constructed on-the-fly. Runtimes will afterwards be faster than with the public API. But if the starting time is a bottleneck for you, going with the public API might be preferable. Regards, Philipp
Will loading take this much long even when trying to fetch triples for one question as mentioned in the readme:
from clocq import CLOCQ
TAGME_TOKEN = "<INSERT YOUR TOKEN HERE>"
cl = CLOCQ(tagme_token=TAGME_TOKEN)
res = cl.get_label("Q5")
print(res)
res = cl.get_search_space("who was the screenwriter for Crazy Rich Asians?")
print(res.keys())
Unfortunately, yes.
The problem is, that even for a single question, you need to load the full KB, i.e. the full KB-index. Loading the index once would take 2-3 hours, depending on the exact KB dump you are using.
Regards, Philipp
Oh, okay. Also, since wikidata knowledge base is updated frequently. Do, I need to make changes in the kb or will the previous timestamp as used here of 2020 will work fine?
This depends on what you want to do with the KB. The KB linked in this repo is from 2020, but we also have a dump from 2022 here: https://github.com/PhilippChr/wikidata-core-for-QA.
If you need the latest data (available), then this might be better. If you want to compare with some method or so, this might be different. The 2022 dump is also a bit larger, and might take more time and RAM to load.
So, does this link https://github.com/PhilippChr/wikidata-core-for-QA have precomputed triples for every question in Lcquad 2.0? My primary aim is to fetch as many triples as possible for every question in mintaka dataset. In readme file of the github link shared by you, says that you form n-triples, so what does that exactly means?
Also, do I need to load kb every time I run the script? If yes, is there any alternative way? Like, I need to fetch triples for every question in mintaka for train, dev, and test. It would then be very time and computationally expensive.
No, the link only has the KB.
If your primary goal is to retrieve relevant KB-facts for every question in LC-QuAD 2.0, I would suggest simply using our public API. You do not need to take care of setting up CLOCQ at your end, and can directly use the 2022 dump. Further, you can control the amount of facts with parameters k and p, as outlined in the CLOCQ paper.
See example code below:
from clocq.interface.CLOCQInterfaceClient import CLOCQInterfaceClient
clocq = CLOCQInterfaceClient(host="https://clocq.mpi-inf.mpg.de/api", port="443")
question = "who plays Viserys in GRRM's latest HBO series?"
# default parameters
res = clocq.get_search_space(question)
print(f'Retrieved {len(res["search_space"])} facts for the question with default params')
# k=5
res = clocq.get_search_space(question, parameters={"k": 5})
print(f'Retrieved {len(res["search_space"])} facts for the question, with k=5')
# k=10
res = clocq.get_search_space(question, parameters={"k": 10})
print(f'Retrieved {len(res["search_space"])} facts for the question, with k=10')
question = "who is the coach of Barcelona?"
# default parameters
res = clocq.get_search_space(question)
print(f'Retrieved {len(res["search_space"])} facts for the question with default params')
# k=5, p=10
res = clocq.get_search_space(question, parameters={"k": 5, "p_setting": 10})
print(f'Retrieved {len(res["search_space"])} facts for the question, with k=5, p=10')
# k=5, p=10k
res = clocq.get_search_space(question, parameters={"k": 5, "p_setting": 100000})
print(f'Retrieved {len(res["search_space"])} facts for the question, with k=5, p=10000')
# k=10, p=10
res = clocq.get_search_space(question, parameters={"k": 10, "p_setting": 10})
print(f'Retrieved {len(res["search_space"])} facts for the question, with k=10, p=10')
# k=10, p=10k
res = clocq.get_search_space(question, parameters={"k": 10, "p_setting": 100000})
print(f'Retrieved {len(res["search_space"])} facts for the question, with k=10, p=10000')
You can load the whole data and KB once, and then run it on the full data at once
Thanks. I'll just try out the above code and reach back to you if I need further help. Thank you for taking out time.
From the above mentioned code, where the output can be seen. It's not in the result directory.
I am not clear about your question. You can just copy the code, and access the output via the variables, store it on disk etc.
E.g. you can iterate through the first 5 KB facts via:
for fact in res["search_space"][:5]:
print(fact)
Hi. So, I was trying to retrieve facts for the question as mentioned in the paper:
Q: Who scored in the 2018 final between France and Croatia?
Facts mentioned in the paper: {⟨ 2018 FIFA World Cup Final, goal scored by, Paul Pogba; for team, France football team ⟩, ⟨ 2018 FIFA World Cup Final, goal scored by, Ivan Perisic; for team, Croatia football team ⟩, . . .}
Facts that I am able to generate for different parameters:
Retrieved 495 facts for the question with default params [{'id': 'Q31043671', 'label': '2018 FIFA World Cup Final'}, {'id': 'P31', 'label': 'instance of'}, {'id': 'Q12708896', 'label': 'FIFA World Cup final'}] [{'id': 'Q31043671', 'label': '2018 FIFA World Cup Final'}, {'id': 'P585', 'label': 'point in time'}, {'id': '"2018-07-15T00:00:00Z"', 'label': '15 July "2018'}] [{'id': 'Q31043671', 'label': '2018 FIFA World Cup Final'}, {'id': 'P1923', 'label': 'participating team'}, {'id': 'Q43249937', 'label': 'France at the 2018 FIFA World Cup\u200E'}] [{'id': 'Q31043671', 'label': '2018 FIFA World Cup Final'}, {'id': 'P1363', 'label': 'points/goal scored by'}, {'id': 'Q455462', 'label': 'Antoine Griezmann'}] [{'id': 'Q31043671', 'label': '2018 FIFA World Cup Final'}, {'id': 'P641', 'label': 'sport'}, {'id': 'Q2736', 'label': 'soccer'}] Retrieved 2977 facts for the question, with k=5 [{'id': 'Q31043671', 'label': '2018 FIFA World Cup Final'}, {'id': 'P31', 'label': 'instance of'}, {'id': 'Q12708896', 'label': 'FIFA World Cup final'}] [{'id': 'Q31043671', 'label': '2018 FIFA World Cup Final'}, {'id': 'P585', 'label': 'point in time'}, {'id': '"2018-07-15T00:00:00Z"', 'label': '15 July "2018'}] [{'id': 'Q31043671', 'label': '2018 FIFA World Cup Final'}, {'id': 'P1923', 'label': 'participating team'}, {'id': 'Q43249937', 'label': 'France at the 2018 FIFA World Cup\u200E'}] [{'id': 'Q31043671', 'label': '2018 FIFA World Cup Final'}, {'id': 'P1363', 'label': 'points/goal scored by'}, {'id': 'Q455462', 'label': 'Antoine Griezmann'}] [{'id': 'Q31043671', 'label': '2018 FIFA World Cup Final'}, {'id': 'P641', 'label': 'sport'}, {'id': 'Q2736', 'label': 'soccer'}] Retrieved 5730 facts for the question, with k=10 [{'id': 'Q31043671', 'label': '2018 FIFA World Cup Final'}, {'id': 'P31', 'label': 'instance of'}, {'id': 'Q12708896', 'label': 'FIFA World Cup final'}] [{'id': 'Q31043671', 'label': '2018 FIFA World Cup Final'}, {'id': 'P585', 'label': 'point in time'}, {'id': '"2018-07-15T00:00:00Z"', 'label': '15 July "2018'}] [{'id': 'Q31043671', 'label': '2018 FIFA World Cup Final'}, {'id': 'P1923', 'label': 'participating team'}, {'id': 'Q43249937', 'label': 'France at the 2018 FIFA World Cup\u200E'}] [{'id': 'Q31043671', 'label': '2018 FIFA World Cup Final'}, {'id': 'P1363', 'label': 'points/goal scored by'}, {'id': 'Q455462', 'label': 'Antoine Griezmann'}] [{'id': 'Q31043671', 'label': '2018 FIFA World Cup Final'}, {'id': 'P641', 'label': 'sport'}, {'id': 'Q2736', 'label': 'soccer'}]
The facts are not sorted in the list. In the paper, we just show the relevant results given the limited space. So you would need to process these facts in the remainder of your QA pipeline. Also, note that the examples in the paper are for illustration and motivation purposes. The exact results may be (slightly) different.
One more thing, I wanted to ask. I have to retrieve facts form questions from other dataset (Mintaka), but public APi are not resulting in good triples. Any thoughts?
Hey, any idea?
When I am trying to load clocq for new dataset, I run the command nohup bash initialize.sh & Output I get:
[2] 37944 (env) gupta@ltgpu2:~/mintaka/CLOCQ/clocq$ nohup: ignoring input and appending output to 'nohup.out'
What does this mean?
You can increase the value of k and p, and see if the answer presence is improving.
Regarding the message in the console you shared, this is just the standard output line of the „nohup“ command. The log is then written to „nohup.out“.
I guess this issue can be closed. Please let me know if you face any other problems!
Hi. So, I followed the steps mentioned in the readme file to setup everything. But when I run the Dynamic usage of CLOCQ section, I get : Dictionaries successfully loaded. KB loading started.
And its stuck there for 2 hours. Does this loading takes lot of time initially or am I missing something?