the default use-case scenario will be the crawler wanting to search a node and have no info about it at all

in this case the crawler is expected to ask the various knowledge backends for info

a policy must be made for this:

try to answer question from nodes in memory
first ask wolfram alpha, if answer to question dont do anything
if question type can not be answered by crawling (how to?) select appropriate backend
- question type how -> scrap wikihow backend
if question can be answered by crawling ask for missing nodes as needed
limit asking to relevant backend depending on the question type
set a priority for each backend when more than 1 is needed

the limiting to question type is important, lets define what each backend is good for:

wolfram alpha:

not structured data collection, but answers user right away, entitys can be tagged in the sentence to populate node-base, exploring the api for ways to get more data useful to populate nodes

answers most questions
parse the answer for entitys (question parser) to populate nodes, but requires other backend to populate info

wikihow

completly unstructured data, returns a list of how tos but is usually not exactly on point

gives a list of how to tutorials
short step by step
extended step by step
pictures of steps
should be used for exclusively how to questions and only when wolfram misses answer
no structured data can be easily taken from there to populate nodes

wikipedia:

unstructured data, but the most usefull for the user, the parsed fields will most likely be the target info to be retrieved by the crawler

picture
short description
extended description
infobox fields
should only be used for the end node of the crawling

wikidata:

This is usefull to get specially seconday node connections and some minor info, propertie lookup must be set in place for properties

get parents
get short description
parsing properties gives a kind of infobox and usefull connections
get picture
not sure for which node to call, maybe start and end node only

dbpedia:

usefull relationship data comes from here, specially for making connections

get picture
get parents
get synonims
get cousins
get links
short subject description
could be called for all nodes, its the most useful

wordnik:

this is a tricky one, it gets mostly grammatical relationships, but some fields are not very factual as long as they are "spoken language compatible", an example case we dont want , "human" is synonim to "fallible"

get synonims
get antonims
get rhymes
get cousins
get word-definitions
should be called when connections are missing and in start/end nodes
should always confirm with user before saving

conceptnet:

this is awesome for connections, however suffers even more from difficult parsing, no fields can be easily put into a connection

examples of "non-parents": frog -> is a: [u'frog', u'amphibi', u'adornment', u'amphibian', u'French person', u'capture'] chicken -> is a: [u'food', u'meat']

should be called when connections are missing and maybe in start/end nodes
parsing is a challenge

further thought is needed

ElliotTheRobot / LILACS-mycroft-core

Backend Priority - Fitting everything into the nodes #33