Closed k0105 closed 8 years ago
Hi! That sounds really awesome! Please do keep us posted about this. :-) I'm also looking forward to learn more about your tricks as well as the details of your configuration. How many human contestants will you have, and I assume they will be amateurs in the domain?
Btw.: Are any new releases planned for the near future (~6 weeks)? I currently use 1.4 and ask myself whether I should upgrade to 1.5 or wait for the next version.
I'm planning to tag the current master as 1.6 whenever I have a moment to consolidate the wiki benchmarks list etc. It's retrained on a much higher quality dataset of questions than until now; otoh if you are retraining the models with your own datasets, I think there aren't many other improvements.
(We focus on developing some neural language models right now, but attempts to start integrating them back to YodaQA already started a few days ago.)
Neural models sound great, looking forward to that.
Grand Challenge is done. I asked questions which a colleague of mine (who is not involved in my team or project) wrote for me so I couldn't influence them in any way.
Most normal people could only answer around 15 questions correctly, but one particularly strong candidate managed to get 24 right. Afterwards, I ran my system against it. Yoda itself answered 13 correctly, the complete system got 24 just like the best human contender.
So the result of my Grand Challenge of man vs. machine is: We are currently in a draw. We can win against average people, but we are "only" on par with the best.
@jbauer180266 - I'm wondering if you have results published somewhere? I'd love to take a look!
Not yet, soon. I'll let you know.
We have superhuman performance after all: I didn't activate the Bing backend when I did my tests, but with it exactly one additional question can be answered correctly, which is one more than the best human could. 25 out of 30. Very nice.
Update: Until now 16 people have taken the test and the result is the same: Best human 24, system 25. Bad news: This time I don't have any kids around for the evaluation, so I can't say anything about humans under 18. Good news: All other age groups covered, fairly many women around (slightly under 50%), almost all educational statuses, all English proficiency levels except for "none" covered, so external validity is decent. Under the assumption that older people know more and that PhD-level subjects are more "dangerous" to the system, this works in our favor - you could say internal validity is increased, I guess.
Congratulations! To test pure stock YodaQA on the challenge, I have created a small JSON dataset and dusted off data/eval/rest-eval.py
.
[
{"qId": "gch000000", "qText": "What is the capital of Zimbabwe?", "answers": ["Harare"]},
{"qId": "gch000001", "qText": "Who invented the Otto engine?", "answers": ["Nikolaus Otto"]},
{"qId": "gch000002", "qText": "When was Pablo Picasso born?", "answers": ["1881"]},
{"qId": "gch000003", "qText": "What is 7*158 + 72 - 72 + 9?", "answers": ["1115"]},
{"qId": "gch000004", "qText": "Who wrote the novel The Light Fantastic?", "answers": ["Terry Pratchett"]},
{"qId": "gch000005", "qText": "In which city was Woody Allen born?", "answers": ["New York"]},
{"qId": "gch000006", "qText": "Who is the current prime minister of Italy?", "answers": ["Matteo Renzi"]},
{"qId": "gch000007", "qText": "What is the equatorial radius of Earth's moon?", "answers": ["1738"]},
{"qId": "gch000008", "qText": "When did the Soviet Union dissolve?", "answers": ["1991"]},
{"qId": "gch000009", "qText": "What is the core body temperature of a human?", "answers": ["37", "98.6"]},
{"qId": "gch000010", "qText": "Who is the current Dalai Lama?", "answers": ["Tenzin Gyatso"]},
{"qId": "gch000011", "qText": "What is 2^23?", "answers": ["8388608"]},
{"qId": "gch000012", "qText": "Who is the creator of Star Trek?", "answers": ["Gene Roddenberry"]},
{"qId": "gch000013", "qText": "In which city is the Eiffel Tower?", "answers": ["Paris"]},
{"qId": "gch000014", "qText": "12 metric tonnes in kilograms?", "answers": ["12 *000"]},
{"qId": "gch000015", "qText": "Where is the mouth of the river Rhine?", "answers": ["the Netherlands"]},
{"qId": "gch000016", "qText": "Where is Buckingham Palace located?", "answers": ["London"]},
{"qId": "gch000017", "qText": "Who directed the movie The Green Mile?", "answers": ["Frank Darabont"]},
{"qId": "gch000018", "qText": "When did Franklin D. Roosevelt die?", "answers": ["1945"]},
{"qId": "gch000019", "qText": "Who was the first man in space?", "answers": ["Yuri Gagarin"]},
{"qId": "gch000020", "qText": "Where was the Peace of Westphalia signed?", "answers": ["Osnabrück", "Münster", "Westphalia"]},
{"qId": "gch000021", "qText": "Who was the first woman to be awarded a Nobel Prize?", "answers": ["Marie Curie"]},
{"qId": "gch000022", "qText": "12.1147 inches to yards?", "answers": ["0.3365194444"]},
{"qId": "gch000023", "qText": "What is the atomic number of potassium?", "answers": ["19"]},
{"qId": "gch000024", "qText": "Where is the Tiananmen Square?", "answers": ["China"]},
{"qId": "gch000025", "qText": "What is the binomial name of horseradish?", "answers": ["Armoracia Rusticana"]},
{"qId": "gch000026", "qText": "How long did Albert Einstein live?", "answers": ["76"]},
{"qId": "gch000027", "qText": "Who earned the most Academy Awards?", "answers": ["Walt Disney", "Katharine Hepburn"]},
{"qId": "gch000028", "qText": "How many lines does the London Underground have?", "answers": ["11"]},
{"qId": "gch000029", "qText": "When is the next planned German Federal Convention?", "answers": []}
]
$ data/eval/rest-eval.py data/eval/gch.json http://qa.ailao.eu:4567/
ID Question Text indicator correct answer found URL
gch000000 What is the capital of Zimbabwe? correct Harare Harare http://qa.ailao.eu:4567//q/1607764502
gch000001 Who invented the Otto engine? correct Nikolaus Otto Nikolaus Otto http://qa.ailao.eu:4567//q/759499198
gch000002 When was Pablo Picasso born? correct 1881 1881 http://qa.ailao.eu:4567//q/615363092
gch000003 What is 7*158 + 72 - 72 + 9? incorrect 1115 78.182.71.65 78 http://qa.ailao.eu:4567//q/1320706932
gch000004 Who wrote the novel The Light Fantastic? correct Terry Pratchett Terry Pratchett http://qa.ailao.eu:4567//q/554560810
gch000005 In which city was Woody Allen born? correct New York New York http://qa.ailao.eu:4567//q/2059328554
gch000006 Who is the current prime minister of Italy? correct Matteo Renzi Matteo Renzi http://qa.ailao.eu:4567//q/958822255
gch000007 What is the equatorial radius of Earth's moon? incorrect 1738 the Moon and Su http://qa.ailao.eu:4567//q/1033514544
gch000008 When did the Soviet Union dissolve? correct 1991 1991 http://qa.ailao.eu:4567//q/913856166
gch000009 What is the core body temperature of a human? incorrect 37 Bio 42 and cour http://qa.ailao.eu:4567//q/854572441
gch000010 Who is the current Dalai Lama? correct Tenzin Gyatso Tenzin Gyatso http://qa.ailao.eu:4567//q/847711277
gch000011 What is 2^23? incorrect 8388608 the Gregorian c http://qa.ailao.eu:4567//q/894392439
gch000012 Who is the creator of Star Trek? correct Gene Roddenberr Gene Roddenberr http://qa.ailao.eu:4567//q/1382088961
gch000013 In which city is the Eiffel Tower? correct Paris Paris http://qa.ailao.eu:4567//q/841767182
gch000014 12 metric tonnes in kilograms? incorrect 12 *000 SI http://qa.ailao.eu:4567//q/474652669
gch000015 Where is the mouth of the river Rhine? correct the Netherlands the Netherlands http://qa.ailao.eu:4567//q/519546828
gch000016 Where is Buckingham Palace located? correct London London http://qa.ailao.eu:4567//q/1500559645
gch000017 Who directed the movie The Green Mile? correct Frank Darabont Frank Darabont http://qa.ailao.eu:4567//q/109783463
gch000018 When did Franklin D. Roosevelt die? correct 1945 1945 http://qa.ailao.eu:4567//q/335174260
gch000019 Who was the first man in space? correct Yuri Gagarin Yuri Gagarin http://qa.ailao.eu:4567//q/333732629
gch000020 Where was the Peace of Westphalia signed? incorrect Osnabrück France http://qa.ailao.eu:4567//q/1894681131
gch000021 Who was the first woman to be awarded a Nobel Priz incorrect Marie Curie Elinor Ostrom http://qa.ailao.eu:4567//q/746167664
gch000022 12.1147 inches to yards? incorrect 0.3365194444 CUX 570 17 577 http://qa.ailao.eu:4567//q/1117248015
gch000023 What is the atomic number of potassium? correct 19 19 http://qa.ailao.eu:4567//q/1563084333
gch000024 Where is the Tiananmen Square? correct China China http://qa.ailao.eu:4567//q/846536947
gch000025 What is the binomial name of horseradish? correct Armoracia Rusti Armoracia Rusti http://qa.ailao.eu:4567//q/1981959830
gch000026 How long did Albert Einstein live? incorrect 76 Germany http://qa.ailao.eu:4567//q/242849537
gch000027 Who earned the most Academy Awards? recall Walt Disney Jimmy Stewart http://qa.ailao.eu:4567//q/299677332
gch000028 How many lines does the London Underground have? incorrect 11 Soho Revue Bar http://qa.ailao.eu:4567//q/1412006804
gch000029 When is the next planned German Federal Convention incorrect 1850 http://qa.ailao.eu:4567//q/1686785663
correctly answered: 18
recall: 1
incorrect: 11
Seems like combined with Wolfram Alpha, the system could answer another at least 4 non-factoid questions, and probably help with at least 2 factoids - which would bring it to 24, but it's pretty likely Wolfram Alpha knows some of the other incorrect factoids too.
For your thesis, I'd also recommend comparing this to plain Wolfram Alpha and Google QA. For the latter, we have a script you can easily use in https://github.com/brmson/google-qa (though it may not correctly extract answer from all non-movie-related results, there is some variability in the HTML code).
Great work!
Thank you very much for the hints - already done from the start, though. I compared my results to both and also confirmed the great synergies between Yoda and Wolfram that I've already reported to you after the pilot study (as you probably remember). The questions are a bit treacherous, btw. - for this particular set Yoda and Wolfram almost achieve answering all questions of my full system correctly, but for larger test sets it became apparent that there are still some holes when only relying on a combination of Wolfram and Yoda which I've been able to plug at least to some degree.
Btw: The evaluation is finished after 26 (13 male, 13 female) subjects. The system still takes the top spot.
[Just to document a minor discrepancy: I had 27 subjects, but couldn't find a 14th female subject, so I burnt a male candidate to get equal numbers by gender, which should slightly increase external validity. The one I burnt was neither the best nor the worst and randomly chosen from all males. Btw: The two best human contenders are male, the third best is female.]
Hi,
just wanted to let you know that I plan on having a competition between human contestants and "my" Yoda version in mid to late April as a grande finale of my contributions so far. Due to some tricks I currently have about twice as many correct (top 1) answers in one third of the time as the default configuration and thus I'm cautiously optimistic the system could win this challenge. Hence, unless some higher power prevents it, this will take place.
Might not be Jeopardy grand champions or covered on live TV, but if Yoda with some additions should be able to win against people running around in a university that would be a great milestone imho. I'm currently in between fear and excitement and will keep you posted about the results.
Best wishes, Joe