Machine learning: build a dataset using StackOverflow questions, and use StackOverflow answers as source of hint for `constraint`

ghost commented 7 years ago

This is a follow-up idea of https://github.com/Mestway/Scythe/issues/3, an even more crazy idea.

Some StackOverflow posts contain input / output examples for SQL queries, for example: https://stackoverflow.com/questions/4672523/sql-server-2000-how-do-i-rotate-the-results-of-a-join-in-the-final-results-of

Table A

ID | Name
---+-----
 1 | John
 2 | Jane
 3 | Bob
 4 | Doug

Table B

ID | NameID | Information
---+--------+------------
 1 |    1   | Apples
 2 |    1   | Apples
 3 |    2   | Pears
 4 |    2   | Grapes
 5 |    3   | Kiwi

Desired Result

ID | Name | InformationA | InformationB
---+------+--------------+-------------
 1 | John | Apples       | Apples
 2 | Jane | Pears        | Grapes
 3 | Bob  | Kiwi         | NULL
 4 | Doug | NULL         | NULL

Since StackOverflow provides open data for data mining / machine learning researchers, it wouldn't be too hard to extract all embedded tables with [sql] tag from StackOverflow questions. The hard part is to decide which table is input, which table is output, and which input is related to which output.

If we are able to create such a dataset, then we can try to train a joined model: the machine learning part extract features from high ranked answers in StackOverflow, and the Scythe part uses the extracted features as the source of heuristic to supervise the synthesizing process.

The dream is, the joined model takes only input, output and a problem description in natural language, queries StackOverflow using the natural language sentences, and try to generate an answer base on the result. Even nicer if it could be integrated to StackOverflow web UI, while this requires collaboration with StackOverflow team.

The realistic is, even the data mining step is very difficult and expensive, a lot of manual data cleaning and labeling is involved. Not sure if it is possible to bootstrap from a small dataset.

(Note, a latex dataset was created from math.stackexchange.com posts for training Image-to-Markup, but that dataset is much simpler than the SQL case)

ghost commented 7 years ago

I downloaded the StackOverflow dataset, which contains 36149135 lines, mosts lines are representing a post, so it's about 30M posts.

$ wc Posts.xml

36149135 4422866242 56228456649 Posts.xml

I randomly sampled about 10% of data:

$ gshuf -n 3000000 Posts.xml > shuf3M.txt

Then extracted only questions:

$ grep AcceptedAnswerId shuf3M.txt > questions.txt

$ wc questions.txt

626749 103212011 1320954228 questions.tx

Then extracted only questions with tag .*sql.*:

$ grep 'Tags=.*sql.*' questions.txt > sqls.txt

$ wc sqls.txt

51607 9027389 108025350 sqls.txt

Then extracted only questions with table, by detecting character sequence of ---+---:

$ grep -- '---+---' sqls.txt > tables.txt

$ wc tables.txt

1286  344902 3909683 tables.txt

That means we are likely to be able to extract at least 10k SQL tables from the whole dataset. However, it's hard to classify input/output and build connections between them, I don't have good ideas for that.

Among those 1286 posts, there are 200+ with FavouriteCount larger than 1, and 70+ with FavouriteCount larger than 2. If we extract SQL tables not only from questions but also from answers, then this number would grow by a few times, but it might be even harder to connect input table with output table.

ghost commented 7 years ago



~~~I decode tables.txt and uploaded it:~~~
~~~$ cat tables.txt | recode xml..utf8 > tables.recode.txt~~~

The previous decoding is not very accurate, a new version is uploaded:

[rows.tar.gz](https://github.com/Mestway/Scythe/files/1270084/rows.tar.gz)

Mestway commented 7 years ago

The idea is great. I looked into Stack Overflow dump a while ago, and I also tried to develop a tool to automatically extract input-output examples together with their correct solutions from posts. I gave up at that time due to the following problems: (1) tables are formatted in many different ways (different row column delimiters for example) (2) tables often contain dirty values (e.g., different users may use "NULL", "None", "-", " " to refer null value) (3) it is hard to figure out the matching solution for the problem (the top question may answer the question in multiple steps and may also demonstrate a wrong solution to help the user to understand the problem better.)

I looked into the data you collected (I scanned table.recode.txt briefly), and find that problem 1 and 2 mentioned above may not be that bad: we can simply strip out all ill-formatted posts (and the number of remaining well formatted ones may still be pretty big). Also I'm wondering how many of these posts have an accepted answer where there is only one query displayed. If the number of remaining posts are still not too small, we can probably start deal with them and figure out how to handle more fuzzy ones.

ghost commented 7 years ago

(3) it is hard to figure out the matching solution for the problem (the top question may answer the question in multiple steps and may also demonstrate a wrong solution to help the user to understand the problem better.)

If we use a set of potential solutions as the source of constraints, then it doesn't matter that we fail to detect the exact solution; Once Scythe generates a solution, we can use a bot to post the solution automatically on StackOverflow, and if it gets enough upvotes then it means the solution is likely to be correct, if it gets downvotes then it means the solution is wrong. However, StackOverflow might ban the bot for abusing...

Mestway commented 7 years ago

One very interesting question I'm thinking about is that whether we can build an fuzzy SQL engine that can execute fuzzy SQL on fuzzy tables on the raw Stack Overflow post. (but sounds pretty hard..)

Anyway, I'm very interested in looking into the dataset later over the weekend. I'm quite curious how many high quality data we can find to feed to the tool.

ghost commented 7 years ago

alltables.txt

tables_clean.tar.gz

I uploaded cleaned up version of tables, hopefully that helps. If you uncompress tables_clean.tar.gz and sort files by size, then after removing super large files and super small files, the remain files seems to contain reasonable data.

ghost commented 7 years ago

I come up with an idea to use deep learning to guess which table is input, which table is output and which one is unrelated:

Firstly, build a training dataset which contains data like below:

(Below is a table of tables, which is confusing, I know)

 table1 | table2 | table3 | table4 | table5 | table6 |
--------+--------+--------+--------+--------+--------+
 Ta1    | Ta2    | Ta3    | <END>  |        |        |
--------+--------+--------+--------+--------+--------+
 Tb1    | Tb2    | Tb3    | Tb4    | <END>  |        |
--------+--------+--------+--------+--------+--------+
 Tc1    | Tc2    | <END>  |        |        |        |
--------+--------+--------+--------+--------+--------+
 Td1    | Td2    | Td3    | <END>  |        |        |

(Below is a table of labels)

 label1 | label2 | label3 | label4 | label5 | label6 |
--------+--------+--------+--------+--------+--------+
 IN     | OUT    | UN     | <END>  |        |        |
--------+--------+--------+--------+--------+--------+
 UN     | IN     | OUT    | OUT    | <END>  |        |
--------+--------+--------+--------+--------+--------+
 IN     | OUT    | <END>  |        |        |        |
--------+--------+--------+--------+--------+--------+
 UN     | UN     | UN     | <END>  |        |        |

Each line represents either a group of tables or a sequence of labels. IN means the corresponding table is an input table, OUT means the corresponding table is an output table, UN means the corresponding table is not related to any other tables in the same group.

This dataset can be built by combining real world positive samples (tables labeled as IN or OUT) with fake negative samples (tables labeled as UN), where real world positive samples are generated by real world database and real world SQL queries, fake negative samples are generated by random picks from real world tables.

After preparing such a dataset, we can train a CNN-LSTM seq2seq model. The CNN network takes a table as an image (*) in each step, extracts features from the table, and then feeds to the LSTM network. The LSTM network learns to predict a sequence of labels.

Given 100k of training samples, this model might be able to take a group of tables and guess whether each one is IN, OUT or UN (with enough luck, hehe)

(*) In the simplest case, given an N x M table, with a token size of D, it could be seen as an image with N x M pixels and D color depth. However, data in a cell are usually variable-length, do we want another LSTM to encode each cell into a fixed-length vector?

Related models could be found at: http://lstm.seas.harvard.edu/latex/ https://github.com/karpathy/neuraltalk2

CC @Wangyida and @da03 since we have some overlapping on research topic :) @da03 hopefully this topic is related to you table OCR project ^^

ghost commented 6 years ago

off topic, but you might be interested in this paper:

https://arxiv.org/abs/1711.06061 An Encoder-Decoder Framework Translating Natural Language to Database Queries

Mestway / Scythe

Machine learning: build a dataset using StackOverflow questions, and use StackOverflow answers as source of hint for `constraint` #4