awslabs / gap-text2sql

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training
https://arxiv.org/abs/2012.10309
Apache License 2.0
102 stars 25 forks source link

Where is the pre-training data stored, want to know the format of input data. #17

Open DHms2020 opened 3 years ago

DHms2020 commented 3 years ago

In the relogic folder, about the tabart-pretraining.py, compare with rat-sql part, I didn't find the specific config file like xx.jsonet. Are the paths of all input data and configuration files specified by the user? how could i know more information about the input data, I have a project that I want to use the GAP method to train on a Chinese dataset, so its important to me to know the format of original pre-training input data. I would be grateful if anyone could tell me,Thanks

Impavidity commented 3 years ago

For the column prediction and recovery, here is one data example.

{"entities": {"Award": [], "Designer": [], "Publisher": ["'Ravensburger'"]}, "control_code": [], "question": "What are the award and designer for the books whose publisher is not \"Ravensburger\"?", "table_info": {"caption": ["Spiel des Jahres", "2008 awards", "Game Of The Year"], "header": ["Game", "Designer", "Publisher", "Award"], "table": [["Stone Age", "Michael Tummelhofer", "Hans im Gl\u00fcck", "Nominee"], ["Keltis", "Reiner Knizia", "Kosmos", "Winner"], ["Witch's Brew", "Andreas Pelikan", "alea / Ravensburger", "Nominee"], ["Blox", "Wolfgang Kramer , J\u00fcrgen P.K. Grunau , Hans Raggan", "Ravensburger", "Nominee"], ["Suleika", "Dominique Ehrhard", "Zoch Spiele", "Nominee"]], "_id": "29364-12", "column_type": ["text", "text", "text", "text"], "table_name": "Game"}, "with_value_entity": ["Publisher"], "entity_to_value": {"Game": ["Stone Age", "Keltis", "Witch's Brew", "Blox", "Suleika"], "Designer": ["Michael Tummelhofer", "Reiner Knizia", "Andreas Pelikan", "Dominique Ehrhard"], "Publisher": ["Hans im Gl\u00fcck", "Kosmos", "alea / Ravensburger", "Ravensburger", "Zoch Spiele"], "Award": ["Nominee", "Winner"]}}
DHms2020 commented 3 years ago

For the column prediction and recovery, here is one data example.

{"entities": {"Award": [], "Designer": [], "Publisher": ["'Ravensburger'"]}, "control_code": [], "question": "What are the award and designer for the books whose publisher is not \"Ravensburger\"?", "table_info": {"caption": ["Spiel des Jahres", "2008 awards", "Game Of The Year"], "header": ["Game", "Designer", "Publisher", "Award"], "table": [["Stone Age", "Michael Tummelhofer", "Hans im Gl\u00fcck", "Nominee"], ["Keltis", "Reiner Knizia", "Kosmos", "Winner"], ["Witch's Brew", "Andreas Pelikan", "alea / Ravensburger", "Nominee"], ["Blox", "Wolfgang Kramer , J\u00fcrgen P.K. Grunau , Hans Raggan", "Ravensburger", "Nominee"], ["Suleika", "Dominique Ehrhard", "Zoch Spiele", "Nominee"]], "_id": "29364-12", "column_type": ["text", "text", "text", "text"], "table_name": "Game"}, "with_value_entity": ["Publisher"], "entity_to_value": {"Game": ["Stone Age", "Keltis", "Witch's Brew", "Blox", "Suleika"], "Designer": ["Michael Tummelhofer", "Reiner Knizia", "Andreas Pelikan", "Dominique Ehrhard"], "Publisher": ["Hans im Gl\u00fcck", "Kosmos", "alea / Ravensburger", "Ravensburger", "Zoch Spiele"], "Award": ["Nominee", "Winner"]}}

@Impavidity It's very helpful, thank you very much ! Besides that, could you provide one data example for the SQL Generation task? Because about your updated code class "QuerySchema2SQLDataset", I could hardly tell the difference between <example["extra"]> and <example["negative"]> in line 48. Looking forward to your reply . Thanks again!

shivashankarrs commented 3 years ago

@Impavidity

Is it possible to share the pre-training data separately (or is already shared somewhere in case I missed it)?

Thanks