Grand Challenge - Githubissues

k0105 commented 8 years ago

Hi,

just wanted to let you know that I plan on having a competition between human contestants and "my" Yoda version in mid to late April as a grande finale of my contributions so far. Due to some tricks I currently have about twice as many correct (top 1) answers in one third of the time as the default configuration and thus I'm cautiously optimistic the system could win this challenge. Hence, unless some higher power prevents it, this will take place.

Might not be Jeopardy grand champions or covered on live TV, but if Yoda with some additions should be able to win against people running around in a university that would be a great milestone imho. I'm currently in between fear and excitement and will keep you posted about the results.

Best wishes, Joe

pasky commented 8 years ago

Hi! That sounds really awesome! Please do keep us posted about this. :-) I'm also looking forward to learn more about your tricks as well as the details of your configuration. How many human contestants will you have, and I assume they will be amateurs in the domain?

k0105 commented 8 years ago

Btw.: Are any new releases planned for the near future (~6 weeks)? I currently use 1.4 and ask myself whether I should upgrade to 1.5 or wait for the next version.

pasky commented 8 years ago

I'm planning to tag the current master as 1.6 whenever I have a moment to consolidate the wiki benchmarks list etc. It's retrained on a much higher quality dataset of questions than until now; otoh if you are retraining the models with your own datasets, I think there aren't many other improvements.

(We focus on developing some neural language models right now, but attempts to start integrating them back to YodaQA already started a few days ago.)

k0105 commented 8 years ago

Neural models sound great, looking forward to that.

k0105 commented 8 years ago

Grand Challenge is done. I asked questions which a colleague of mine (who is not involved in my team or project) wrote for me so I couldn't influence them in any way.

Most normal people could only answer around 15 questions correctly, but one particularly strong candidate managed to get 24 right. Afterwards, I ran my system against it. Yoda itself answered 13 correctly, the complete system got 24 just like the best human contender.

So the result of my Grand Challenge of man vs. machine is: We are currently in a draw. We can win against average people, but we are "only" on par with the best.

jspalink commented 8 years ago

@jbauer180266 - I'm wondering if you have results published somewhere? I'd love to take a look!

k0105 commented 8 years ago

Not yet, soon. I'll let you know.

k0105 commented 8 years ago

We have superhuman performance after all: I didn't activate the Bing backend when I did my tests, but with it exactly one additional question can be answered correctly, which is one more than the best human could. 25 out of 30. Very nice.

k0105 commented 8 years ago

Update: Until now 16 people have taken the test and the result is the same: Best human 24, system 25. Bad news: This time I don't have any kids around for the evaluation, so I can't say anything about humans under 18. Good news: All other age groups covered, fairly many women around (slightly under 50%), almost all educational statuses, all English proficiency levels except for "none" covered, so external validity is decent. Under the assumption that older people know more and that PhD-level subjects are more "dangerous" to the system, this works in our favor - you could say internal validity is increased, I guess.

pasky commented 8 years ago

Congratulations! To test pure stock YodaQA on the challenge, I have created a small JSON dataset and dusted off data/eval/rest-eval.py.

[
{"qId": "gch000000", "qText": "What is the capital of Zimbabwe?", "answers": ["Harare"]},
{"qId": "gch000001", "qText": "Who invented the Otto engine?", "answers": ["Nikolaus Otto"]},
{"qId": "gch000002", "qText": "When was Pablo Picasso born?", "answers": ["1881"]},
{"qId": "gch000003", "qText": "What is 7*158 + 72 - 72 + 9?", "answers": ["1115"]},
{"qId": "gch000004", "qText": "Who wrote the novel The Light Fantastic?", "answers": ["Terry Pratchett"]},
{"qId": "gch000005", "qText": "In which city was Woody Allen born?", "answers": ["New York"]},
{"qId": "gch000006", "qText": "Who is the current prime minister of Italy?", "answers": ["Matteo Renzi"]},
{"qId": "gch000007", "qText": "What is the equatorial radius of Earth's moon?", "answers": ["1738"]},
{"qId": "gch000008", "qText": "When did the Soviet Union dissolve?", "answers": ["1991"]},
{"qId": "gch000009", "qText": "What is the core body temperature of a human?", "answers": ["37", "98.6"]},
{"qId": "gch000010", "qText": "Who is the current Dalai Lama?", "answers": ["Tenzin Gyatso"]},
{"qId": "gch000011", "qText": "What is 2^23?", "answers": ["8388608"]},
{"qId": "gch000012", "qText": "Who is the creator of Star Trek?", "answers": ["Gene Roddenberry"]},
{"qId": "gch000013", "qText": "In which city is the Eiffel Tower?", "answers": ["Paris"]},
{"qId": "gch000014", "qText": "12 metric tonnes in kilograms?", "answers": ["12 *000"]},
{"qId": "gch000015", "qText": "Where is the mouth of the river Rhine?", "answers": ["the Netherlands"]},
{"qId": "gch000016", "qText": "Where is Buckingham Palace located?", "answers": ["London"]},
{"qId": "gch000017", "qText": "Who directed the movie The Green Mile?", "answers": ["Frank Darabont"]},
{"qId": "gch000018", "qText": "When did Franklin D. Roosevelt die?", "answers": ["1945"]},
{"qId": "gch000019", "qText": "Who was the first man in space?", "answers": ["Yuri Gagarin"]},
{"qId": "gch000020", "qText": "Where was the Peace of Westphalia signed?", "answers": ["Osnabrück", "Münster", "Westphalia"]},
{"qId": "gch000021", "qText": "Who was the first woman to be awarded a Nobel Prize?", "answers": ["Marie Curie"]},
{"qId": "gch000022", "qText": "12.1147 inches to yards?", "answers": ["0.3365194444"]},
{"qId": "gch000023", "qText": "What is the atomic number of potassium?", "answers": ["19"]},
{"qId": "gch000024", "qText": "Where is the Tiananmen Square?", "answers": ["China"]},
{"qId": "gch000025", "qText": "What is the binomial name of horseradish?", "answers": ["Armoracia Rusticana"]},
{"qId": "gch000026", "qText": "How long did Albert Einstein live?", "answers": ["76"]},
{"qId": "gch000027", "qText": "Who earned the most Academy Awards?", "answers": ["Walt Disney", "Katharine Hepburn"]},
{"qId": "gch000028", "qText": "How many lines does the London Underground have?", "answers": ["11"]},
{"qId": "gch000029", "qText": "When is the next planned German Federal Convention?", "answers": []}
]

$ data/eval/rest-eval.py data/eval/gch.json http://qa.ailao.eu:4567/
ID              Question Text                                           indicator       correct answer  found           URL
gch000000       What is the capital of Zimbabwe?                        correct         Harare          Harare          http://qa.ailao.eu:4567//q/1607764502
gch000001       Who invented the Otto engine?                           correct         Nikolaus Otto   Nikolaus Otto   http://qa.ailao.eu:4567//q/759499198
gch000002       When was Pablo Picasso born?                            correct         1881            1881            http://qa.ailao.eu:4567//q/615363092
gch000003       What is 7*158 + 72 - 72 + 9?                            incorrect       1115            78.182.71.65 78 http://qa.ailao.eu:4567//q/1320706932
gch000004       Who wrote the novel The Light Fantastic?                correct         Terry Pratchett Terry Pratchett http://qa.ailao.eu:4567//q/554560810
gch000005       In which city was Woody Allen born?                     correct         New York        New York        http://qa.ailao.eu:4567//q/2059328554
gch000006       Who is the current prime minister of Italy?             correct         Matteo Renzi    Matteo Renzi    http://qa.ailao.eu:4567//q/958822255
gch000007       What is the equatorial radius of Earth's moon?          incorrect       1738            the Moon and Su http://qa.ailao.eu:4567//q/1033514544
gch000008       When did the Soviet Union dissolve?                     correct         1991            1991            http://qa.ailao.eu:4567//q/913856166
gch000009       What is the core body temperature of a human?           incorrect       37              Bio 42 and cour http://qa.ailao.eu:4567//q/854572441
gch000010       Who is the current Dalai Lama?                          correct         Tenzin Gyatso   Tenzin Gyatso   http://qa.ailao.eu:4567//q/847711277
gch000011       What is 2^23?                                           incorrect       8388608         the Gregorian c http://qa.ailao.eu:4567//q/894392439
gch000012       Who is the creator of Star Trek?                        correct         Gene Roddenberr Gene Roddenberr http://qa.ailao.eu:4567//q/1382088961
gch000013       In which city is the Eiffel Tower?                      correct         Paris           Paris           http://qa.ailao.eu:4567//q/841767182
gch000014       12 metric tonnes in kilograms?                          incorrect       12 *000         SI              http://qa.ailao.eu:4567//q/474652669
gch000015       Where is the mouth of the river Rhine?                  correct         the Netherlands the Netherlands http://qa.ailao.eu:4567//q/519546828
gch000016       Where is Buckingham Palace located?                     correct         London          London          http://qa.ailao.eu:4567//q/1500559645
gch000017       Who directed the movie The Green Mile?                  correct         Frank Darabont  Frank Darabont  http://qa.ailao.eu:4567//q/109783463
gch000018       When did Franklin D. Roosevelt die?                     correct         1945            1945            http://qa.ailao.eu:4567//q/335174260
gch000019       Who was the first man in space?                         correct         Yuri Gagarin    Yuri Gagarin    http://qa.ailao.eu:4567//q/333732629
gch000020       Where was the Peace of Westphalia signed?               incorrect       Osnabrück       France          http://qa.ailao.eu:4567//q/1894681131
gch000021       Who was the first woman to be awarded a Nobel Priz      incorrect       Marie Curie     Elinor Ostrom   http://qa.ailao.eu:4567//q/746167664
gch000022       12.1147 inches to yards?                                incorrect       0.3365194444    CUX 570 17 577  http://qa.ailao.eu:4567//q/1117248015
gch000023       What is the atomic number of potassium?                 correct         19              19              http://qa.ailao.eu:4567//q/1563084333
gch000024       Where is the Tiananmen Square?                          correct         China           China           http://qa.ailao.eu:4567//q/846536947
gch000025       What is the binomial name of horseradish?               correct         Armoracia Rusti Armoracia Rusti http://qa.ailao.eu:4567//q/1981959830
gch000026       How long did Albert Einstein live?                      incorrect       76              Germany         http://qa.ailao.eu:4567//q/242849537
gch000027       Who earned the most Academy Awards?                     recall          Walt Disney     Jimmy Stewart   http://qa.ailao.eu:4567//q/299677332
gch000028       How many lines does the London Underground have?        incorrect       11              Soho Revue Bar  http://qa.ailao.eu:4567//q/1412006804
gch000029       When is the next planned German Federal Convention      incorrect                       1850            http://qa.ailao.eu:4567//q/1686785663
correctly answered: 18
recall: 1
incorrect: 11

Seems like combined with Wolfram Alpha, the system could answer another at least 4 non-factoid questions, and probably help with at least 2 factoids - which would bring it to 24, but it's pretty likely Wolfram Alpha knows some of the other incorrect factoids too.

For your thesis, I'd also recommend comparing this to plain Wolfram Alpha and Google QA. For the latter, we have a script you can easily use in https://github.com/brmson/google-qa (though it may not correctly extract answer from all non-movie-related results, there is some variability in the HTML code).

Great work!

k0105 commented 8 years ago

Thank you very much for the hints - already done from the start, though. I compared my results to both and also confirmed the great synergies between Yoda and Wolfram that I've already reported to you after the pilot study (as you probably remember). The questions are a bit treacherous, btw. - for this particular set Yoda and Wolfram almost achieve answering all questions of my full system correctly, but for larger test sets it became apparent that there are still some holes when only relying on a combination of Wolfram and Yoda which I've been able to plug at least to some degree.

Btw: The evaluation is finished after 26 (13 male, 13 female) subjects. The system still takes the top spot.

[Just to document a minor discrepancy: I had 27 subjects, but couldn't find a 14th female subject, so I burnt a male candidate to get equal numbers by gender, which should slightly increase external validity. The one I burnt was neither the best nor the worst and randomly chosen from all males. Btw: The two best human contenders are male, the third best is female.]

brmson / yodaqa

Grand Challenge #38