Quick Intro
Watsonsim works using a pipeline of operations on questions, candidate answers, and their supporting passages. In many ways it is similar to IBM's Watson, and Petr's YodaQA. It's not all that similar to more logic based systems like OpenCog or Wolfram Alpha. But there are significant differences even from Watson and YodaQA.
- We don't use a standard UIMA pipeline, which is a product of our student-project history. Sometimes this is a hindrance but typically it has little impact. We suspect it reduces the learning overhead and boilerplate code.
- Unlike YodaQA, we target Jeopardy! questions, but we do incorporate their method of Lexical Answer Type (LAT) checking, in addition to our own.
- Our framework is rather heavyweight in terms of computation. Depending on what modules are enabled, it can take between about 1 second and 2 minutes to answer a question. We use Indri to improve accuracy but it is now an optional feature that we highly recommend. (We are investigating alternatives as well.)
- We include (relatively) large amounts of preprocessed article text from Wikipedia as our inputs. Be prepared to use about 100GB of space if you want to try it out at its full power.
Installing the Simulator
- Use git to clone this repository, as in:
git clone https://github.com/SeanTater/uncc2014watsonsim.git
- Install Java 8, either:
- libSVM machine learning library (native)
- Download Gradle (just unzip it; keep in mind it updates very often)
- Download the latest data and place them in the data/ directory
- Copy the configuration file
config.properties.sample
to config.properties
and customize to your liking
- Run
gradle eclipse -Ptarget
in uncc2014watsonsim/
to download platform-independent dependencies and create an Eclipse project.
- Possibly enable some Optional Features
Running the Simulator
We recommend running the simulator with Gradle:
gradle run -Ptarget=WatsonSim
But, if you prefer, you can also use Eclipse. First create a project.
gradle eclipse -Ptarget
Then you can run WatsonSim.java directly.
There are a few other features as well
# Generate statistics reports for accuracy and other measurements
gradle run -Ptarget=scripts.ParallelStats
# Regenerate the Indri, Lucene, SemanticVectors, Bigram and Edge indices
gradle run -Ptarget=index.Reindex
Technologies Involved
This list isn't exhaustive, but it should be a good overview
- Search
- Text search from Lucene and Indri (Terrier upcoming)
- Web search from Bing (Google is in the works)
- Relational queries using PostgreSQL and SQLite
- Linked data queries using Jena
- Sources
- Text from all the articles in Wikipedia, Simple Wikipedia, Wiktionary, and Wikiquotes
- Linked data from DBPedia, used for LAT detection
- Wikipedia pageviews organized by article
- Source, target, and label from all links in Wikipedia
- Machine learning with Weka and libSVM
- Text parsing and dependency generation from CoreNLP and OpenNLP
- Parsing logic in Prolog (with TuProlog)
Notes:
- You should probably consider using PostgreSQL if you scale this project to more than a few cores, or any distributed environment. It should support both engines nicely.
- The data is sizable and growing, especially for statistics reports; 154.5 GB as of the time of this writing.
- Can't find libindri-jni? Make sure you enabled Java and SWIG and had the right dependencies when compiling Indri.
Tools
Giving Back
Do you like this project? Then help make it better! We can use all kinds of help, whether you're a scientist, an engineer, or just a curious user!
Also, you may be interested to read (or to cite!) our paper:
@TechReport{GallagherTR2014,
author = {Gallagher, Sean and Zadrozny, Wlodek W. and Shalaby, Walid and Avadhani, Adarsh},
title = {Watsonsim: Overview of a Question Answering Engine},
institution = {University of North Carolina at Charlotte},
month = {December},
year = {2014},
}