deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
16.83k stars 1.85k forks source link

Can we use the pre-trained model on any other dataset for SPARQL query generation? #1194

Closed Subhashree-Tripathy closed 3 years ago

Subhashree-Tripathy commented 3 years ago

Question In the Question Answering on Knowledge Graph, can we use the pre-trained model for any other dataset for SPARQL query generation? Additional context Add any other context or screenshots about the question (optional).

julian-risch commented 3 years ago

Hi @Subhashree-Tripathy the pre-trained model can only generate questions about resources it has seen during training. Otherwise, it cannot translate the name of the resource to the identifier used in the knowledge graph. For example, it can translate "Harry" to "hp:Harry_potter" only because we trained it to do so. We don't support the training of a custom model for Text2SPARQL in haystack but the documentation gives some hints how to do it: https://haystack.deepset.ai/docs/latest/knowledgegraphmd#Trying-Question-Answering-on-Knowledge-Graphs-with-Custom-Data

julian-risch commented 3 years ago

@Subhashree-Tripathy Do you have any other question regarding Question Answering on Knowledge Graphs? If not, I will close this issue. Feel free to open a new one if anything else comes up.

ilseva commented 2 years ago

Hi @julian-risch , I reopen this issue to ask some clarification about the pre-trained model you used in example. I looked at the model and the seq2seq example for summarization with BART in transformers, but I didn't understand the process you followed to create your model. Could you give me more details about it? e.g.: the huggingface page talks about 6 files

train.source
train.target
val.source
val.target
test.source
test.target

but I didn't find them in your model.... Thanks. Sevastian

julian-risch commented 2 years ago

Hi @ilseva we didn't publish these files. For training your own seq2seq model you would need to create training, validation and test data yourself with data from your domain and with identifiers as in your own knowledge graph. Essentially the format of a training sample is the input sequence you give to the model (natural language question) and the output sequence that you would like to get from the model (SPARQL query). So the input could be What is the actor's name who portrays the main character in the Harry Potter movies? and the expected output could be "SELECT ?obj WHERE { wd:Q3244512 p:P2868 ?s . ?s ps:P2868 ?obj . ?s pq:P642 wd:Q8337 }"in SPARQL format with identifiers from wiki data. However, we wanted to allow the model to come up with identifiers (e.g. wd:Q8337) without needing to memorize them from the training data. To this end, we made identifiers encode the name of the entity. The model can then come up with identifiers no matter it has seen this exact identifier at training time (and as a downside no matter it exists in your knowledge graph or not). Given input in with our identifiers could be: In which house is Nicola Dodworth? and expected output would SELECT ?uri WHERE { <https://deepset.ai/harry_potter/Nicola_dodworth> <https://deepset.ai/harry_potter/house> ?uri }. Each such pair of a natural language question and the corresponding SPARQL query with the identifiers from your knowledge graph form a training/validation/testing data sample.

ilseva commented 2 years ago

Thanks @julian-risch for the hints. We will work on our data!