Recreate model with different dataset

SteinerPascal commented 1 year ago

Hi together, First, thanks for your work and publishing your code on GitHub!

Now, I would like to train a new model with my own dataset. I would like to create mimic the Monument dataset with my own data. I had a look at the code and I started with the following: 1. Running the extract_templates.py which with a couple of tweaks gave me the templates of the type:

{
        "_id": "1",
        "question_example": "which room has a temperature sensor ?",
        "interm_sparql_template": "select var_a where brack_open var_a rdf_type brick_Room sep_dot var_a brick_has_Point brick_Zone_Air_Temperature_Sensor brack_close",
        "pure_sparql_template": "select ?a where {?a rdf:type brick_Room . ?a brick_has_Point brick_Zone_Air_Temperature_Sensor }"
 },

2.

Running the extract_templates.py with the --step=2 flag. But it complains because I need to transform the question_example to question_template. Now I'm not sure how to tackle that... I looked at an example of yours

    {
        "_id": 1,
        "question_template": "is <resource> a <ontology> ?",
        "uri_question_template": "is <resource> a <ontology> ?",
        "interm_sparql_template": "ask where brack_open <resource> rdf_type <ontology> brack_close",
        "uri_interm_sparql_template": "ask where brack_open <resource> rdf_type <ontology> brack_close",
        "pure_sparql_template": "ask where { <resource> rdf:type <ontology> }",
        "question_regex": "is (.*?) a (.*?) \\?$",
        "interm_sparql_regex": "ask where brack_open (.*?) rdf_type (.*?) brack_close$",
        "pure_sparql_regex": "ask where {(.*?) rdf:type (.*?) }$"
    },

Now two questions: 1.Did you annotate them all by hand or is there a helper script of some sorts? 2.What is the meaning of and . Does the naming matter? or could it be ? sometimes i see and ...

SteinerPascal commented 1 year ago

@rooose & @Lama-West Oke i managed to do the second step as well (scratch the questions from the previous entry). The output for each template now is the following:

    {
        "_id": "4",
        "question_template": "which <ontology> has a <resource> ?",
        "interm_sparql_template": "select var_a where brack_open var_a rdf_type brick_Room sep_dot var_a brick_has_Point brick_Zone_Temparatur_Sensor brack_close",
        "pure_sparql_template": "select ?a where {?a rdf:type brick_Room . ?a brick_has_Point brick_Zone_Temparatur_Sensor }",
        "question_regex": "which (.*?) has a (.*?) \\?$",
        "interm_sparql_regex": "select var_a where brack_open var_a rdf_type brick_Room sep_dot var_a brick_has_Point brick_Zone_Temparatur_Sensor brack_close$",
        "pure_sparql_regex": "select \\?a where {\\?a rdf:type brick_Room \\. \\?a brick_has_Point brick_Zone_Temparatur_Sensor }$"
    }

But now if i try to run the src/monument/build_dataset.py it complains here: https://github.com/Lama-West/SPARQL_Query_Generation_aacl-ijcnl2022/blob/3b0a0f5ef014ff48c71539ea04b06899e46ea0d1/Data/src/classes/monument_dataset.py#L99 because my dict does not contain the uri_interm_sparql_template key/value. (Also i saw that I'm missing the uri_question_template) Did you generate that by hand? Because I'm not able to find a code part which could be doing this. If yes then I have a question. Looking at your example I can see the following:

And general question: What is actually needed from the dataset for the models? It looks like from the dataset entry:

{
        "_id": "4526",
        "template_id": 6,
        "question": {
            "question": "where is haji gayib\u2019s bathhouse located in ?",
            "uri_question_only_resources": "where is dbr:Haji_Gayib\u2019s_bathhouse located in ?",
            "uri_question_rest_no_resources": "where is haji gayib\u2019s bathhouse dbo:location ?",
            "uri_question_all": "where is dbr:Haji_Gayib\u2019s_bathhouse dbo:location ?"
        },
        "query": {
            "interm_sparql": "select var_a where brack_open dbr_Haji_Gayib\u2019s_bathhouse dbo_location var_a brack_close",
            "uri_interm_sparql_only_resources": "select var_a where brack_open dbr:Haji_Gayib\u2019s_bathhouse dbo_location var_a brack_close",
            "uri_interm_sparql_rest_no_resources": "select var_a where brack_open dbr_Haji_Gayib\u2019s_bathhouse dbo:location var_a brack_close",
            "uri_interm_sparql_all": "select var_a where brack_open dbr:Haji_Gayib\u2019s_bathhouse dbo:location var_a brack_close",
            "pure_sparql": "select ?a where { dbr:Haji_Gayib\u2019s_bathhouse dbo:location ?a }"
        },
        "set": "train",
        "original_data": {
            "question": "where is haji gayib\u2019s bathhouse located in",
            "interm_sparql": "select var_a where brack_open dbr_Haji_Gayib\u2019s_bathhouse dbo_location var_a brack_close"
        }
    }

only needs these two keys are used within 1-ConvS2S.ipynb : train_examples = [(entry['original_data']['lcquad']['intermediary_question'].lower().replace('<','').replace('>',''), entry['query']['interm_sparql']

rooose commented 1 year ago

Hi 😄 Basically, the model only needs one question and one query + the template (we generate many formats to run tests, but you really only need one of each). I think it works best if the query is in intermediary SPARQL, but it should work with pure SPARQL too. We had to do some processing of the templates by hand to ensure they are all in the correct format.

To make it work with your stuff, you need to change the line you found to train_examples = [ (entry['your_question'], entry['your_query']) for entry in dataset ] (and the following lines that set the validation and test examples as well)

SteinerPascal commented 1 year ago

Hi! Thank you for taking the time to answer my questions. Forgive me for being a bit slow-witted but I'm an amateur in ML.

I thought the template is only needed for data generation? Is it also used in the model? For what?
You said it works best with the "intermediary query". I guess you refer then to the "uri_interm_sparql_all? Since the copy vocabulary needs to be built in def abstract_KB_elems(data)?

rooose commented 1 year ago

Sorry you are right, it's been a while since I worked on this - The templates are only used for data generation. We were using them in the tagging step, which is why I got mixed up
Exactly!

SteinerPascal commented 1 year ago

Okay thank you very much!

Lama-West / SPARQL_Query_Generation_aacl-ijcnl2022

Recreate model with different dataset #1