dbpedia / GSoC

Google Summer of Code organization
37 stars 27 forks source link

A Neural QA Model for DBpedia #1

Closed mgns closed 5 years ago

mgns commented 6 years ago

Description

In the last years, the Linked Data Cloud has grown to over 100 billion facts pertaining to a multitude of domains. The DBpedia knowledge base consists of 4.58 million things on its own. However, accessing this information is challenging for lay users as they are not able to use SPARQL as querying language without exhaustive training. Recently, Deep Learning architectures based on Neural Networks called seq2seq have shown to achieve the state-of-the-art results at translating sequences into sequences. In this direction, we suggest a GSoC topic around Neural Networks to translate any natural language expression into sentences encoding SPARQL queries. Our preliminary work on Question Answering with Neural SPARQL Machines (NSpM) shows promising results but it is restricted on selected DBpedia classes. In this GSoC project, the candidate will extend the NSpM to cover more classes of DBpedia and to enable high-quality Question Answering. The source code can be found here, however we will use this repository as workspace.

Goals

mjohenneken commented 6 years ago

Sounds interesting! I will have a look on this after my exams.

abhinavralhan commented 6 years ago

@mgns Hi. This sounds very interesting to me. I had a few queries to ask though.

RicardoUsbeck commented 6 years ago

You are welcome to ask them here.

abhinavralhan commented 6 years ago

@RicardoUsbeck Yeah so I was attempting the Warmup tasks. Can you please explain how should I attempt the second task?

RicardoUsbeck commented 6 years ago

@mommi84 is the better person to explain that :)

mommi84 commented 6 years ago

@abhinavralhan Hi! I added a link in the project description.

abhinavralhan commented 6 years ago

@mommi84 a little problem with the last inference bit. I got past most problems except this one I cannot understand. So I ran the command below, but for some reason vocab file is not getting generated. I'm guessing it has to do with something in the ask.sh file.

sudo sh ask.sh data/monument_300_model "where is edward vii monument located in?"

Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/home/abhinavralhan/Desktop/os/NSpM/nmt/nmt/nmt.py", line 495, in tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 124, in run _sys.exit(main(argv)) File "/home/abhinavralhan/Desktop/os/NSpM/nmt/nmt/nmt.py", line 488, in main run_main(FLAGS, default_hparams, train_fn, inference_fn) File "/home/abhinavralhan/Desktop/os/NSpM/nmt/nmt/nmt.py", line 452, in run_main hparams = create_or_load_hparams(out_dir, default_hparams, flags.hparams_path) File "/home/abhinavralhan/Desktop/os/NSpM/nmt/nmt/nmt.py", line 418, in create_or_load_hparams hparams = extend_hparams(hparams) File "/home/abhinavralhan/Desktop/os/NSpM/nmt/nmt/nmt.py", line 350, in extend_hparams unk=vocab_utils.UNK) File "nmt/utils/vocab_utils.py", line 66, in check_vocab raise ValueError("vocab_file does not exist.") ValueError: vocab_file does not exist. train_prefix=None dev_prefix=None test_prefix=None out_dir=../data/monument_300_model_model

ANSWER IN SPARQL SEQUENCE: cat: output.txt: No such file or directory

gyanesh-m commented 6 years ago

@abhinavralhan Hi, I faced a similar issue. The problem is due to incorrect data directory passed to ask.sh. Currently it is data/monument_300_model. You need to change the input directory for ask.sh to data/monument_300 .

mommi84 commented 6 years ago

Thank you @abhinavralhan for sharing the bug and thanks @gyanesh-m for fixing it.

gyanesh-m commented 6 years ago

@mommi84 Hi, I have been trying to build some queries for dbo:EducationalInstitution class and I had a doubt related to query templates. In all the examples mentioned in annotations_monument.csv, the initial statement

    ?a a dbo:Monument 

was never used. Is it because you have already specified the domain for \<A> in first column. Also, If I plan to use the initial statement for my example class in query template, will it be fine ?

pijusch commented 6 years ago

Hey @gyanesh-m, No you don't need to explicitly add that. prepare_generator_query adds the "initial statement" to the generator queries.

gyanesh-m commented 6 years ago

@piyush96chawla Oh , thanks.

mommi84 commented 6 years ago

Added example of successful project proposal. Once ready, please invite my username at gmail dot com to your proposal document.

mommi84 commented 6 years ago

The GSoC 2018 student applications are officially open! Please elaborate your proposal in a Google doc. When you're done, share it with my username at gmail.com, so I can invite also the other mentors. Deadline: March 27.

srtarun commented 6 years ago

Interesting project! I have started working on it already.

mommi84 commented 6 years ago

Only 6 days to go!

Please share your document with us now, if you would like to have some feedback from the mentors before the final submission to the GSoC console.

amanmehta-maniac commented 6 years ago

@mommi84 I have completed my warmup task by training the NSpM model on class dbo:Garden. I have experience in QA, deep learning, and I really appreciate and like the idea of a Neural QA model. What do you suggest I should do next in order to get ready to write a proposal? Sorry for such a late request for help.

mommi84 commented 6 years ago

@amanmehta-maniac Great. Please elaborate your view of the project and your findings in a Google document and share it with me at gmail.com. Use this one as a reference for a good proposal. Send it out on Sunday 25th at the latest, so we can give you some feedback.

amanmehta-maniac commented 6 years ago

@mommi84, there are around 770 total number of DBpedia classes and hence what would be a realistic 'X', where X is the number of classes that I will train this NN for during the GSoC tenure. Also, another matter of concern is, will I be getting access to any server, because this project involves training NN which is highly CPU intensive and takes order of hours on 64 core-CPU and 512 GB RAM machine.

mommi84 commented 6 years ago

@amanmehta-maniac We expect to have one final QA model, not as many as the number of classes. Moreover, these 770 classes are organized in a taxonomy. We will likely be able to grant the student access to our servers upon availability for a limited time (up to 14 days in a row). Of course, additional hardware provided by the student's institution is welcome.