dbpedia / GSoC

Google Summer of Code organization
37 stars 27 forks source link

A Multilingual Neural RDF Verbalizer #31

Closed DiegoMoussallem closed 4 years ago

DiegoMoussallem commented 5 years ago

Description:

Natural Language Generation (NLG) is the process of generating coherent natural language text from non-linguistic data (Reiter and Dale, 2000). Despite community agreement on the actual text and speech output of these systems, there is far less consensus on what the input should be (Gatt and Krahmer, 2017). A large number of inputs have been taken for NLG systems, including images (Xu et al., 2015), numeric data (Gkatzia et al., 2014), semantic representations (Theune et al., 2001) and Semantic Web (SW) data (Ngonga Ngomo et al., 2013; Bouayad-Agha et al., 2014). Presently, the generation of natural language from SW, more precisely from RDF data, has gained substantial attention (Bouayad-Agha et al., 2014; Staykova, 2014). Some challenges have been proposed to investigate the quality of automatically generated texts from RDF (Colin et al., 2016). Moreover, RDF has demonstrated a promising ability to support the creation of NLG benchmarks (Gardent et al., 2017). However, English is the only language which has been widely targeted. Even though there are studies which explore the generation of content in languages other than English, to the best of our knowledge, no work has been proposed to train a multilingual neural model for generating texts in different languages from RDF data.

Goals:

In this GSoc Project, the candidate is entitled to train a multilingual neural model which is capable of generating natural language sentences from DBpedia RDF triples.

Impact:

The project may allow users to generate automatically short summaries about entities which do not have a human abstract using triples.

Warm-up tasks:

Mentors

Diego Moussallem

Keywords

NLG, Semantic Web, NLP

SilentFlame commented 5 years ago

This problem statement really interests me, I'll look around with the papers mentioned and get back with queries if stuck in between.

DwaraknathT commented 5 years ago

This problem is similar to what I am working on right now, and it's very interesting. I have created a repo and will upload the summaries of papers that are mentioned here along with my own improvement ideas, please go through them and give feedback. Also, porting the NeuralREG code to pytorch might be really helpful and pytext is a boon for data preprocessing. https://github.com/DwaraknathT/NLG-

DiegoMoussallem commented 5 years ago

@DwaraknathT I have been following your summaries, let me know if you need some help. We can have a talk as well. Looking forward to next summaries.

DiegoMoussallem commented 5 years ago

@SilentFlame How about you? How is it going with the papers?

DwaraknathT commented 5 years ago

@DiegoMoussallem Thank you, I appreciate any comments or suggestions you have for me. Right now, I'm trying to reimplement NeuralREG code but with transformers, to see how much gains we might be abe to get. The fundamental algorithm is same, but it was just a hunch to understand the code properly. May I contact you on slack for further discussions ?

DiegoMoussallem commented 5 years ago

of course, you may contact me. I got your point about checking with the Transformer, but it might not be so necessary at the moment.

aditya-malte commented 5 years ago

I find this problem interesting. We could improve the state-of-the-art by implementing a Transformer architecture(A. Vaswani et. al.) for these very same papers.

DiegoMoussallem commented 5 years ago

Hi @aditya-malte It sounds good. Could you send me a message with the details?

ovshake commented 5 years ago

Hi @DiegoMoussallem ! I feel this problem statement well aligns with my current research interests and would love to work on it. I agree with @aditya-malte , transformer architecture should show improvement, but I feel the extra computational complexity introduced by that architecture should be compensated by introducing efficient attention models such as lightweight and dynamic convolutional attention mentioned here. Quasi-RNNs should be swapped with vanilla-LSTMs. This will allow us to generate even longer and descriptive sentences. I would love to work on this if this is not already taken.

DiegoMoussallem commented 5 years ago

@ovshake Thanks for being interested in this project. Your plan sounds good, I am looking forward to seeing your proposal. Feel free to contact me.

ovshake commented 5 years ago

@DiegoMoussallem Does "Multilingual" mean the model should be able to generate the referring expression in a language other than the language in which the RDF is given or the model should be able to handle any language as the input RDF (other than the language it is trained on) and should give the RF in the same language?

DiegoMoussallem commented 5 years ago

Hi @ovshake, "Multilingual" means to be able to generate a natural language sentence in multiple languages from a given RDF input <s,p,o>. It is not only about the referring expression generation, but it must also include a complete verbalization, i.e., with articles, verbs and etc.

ovshake commented 5 years ago

Hi @DiegoMoussallem, I have some questions regarding my proposal, which platform should I use reach out to you that is most convenient to you?

DiegoMoussallem commented 5 years ago

Hey @ovshake feel free to contact me on skype diegomoussallem

ovshake commented 5 years ago

I have messaged you on skype.

ovshake commented 5 years ago

@DiegoMoussallem I have shared my proposal on the GSoC platform. Do review it at your convenience.