the default use-case scenario will be the crawler wanting to search a node and have no info about it at all
in this case the crawler is expected to ask the various knowledge backends for info
a policy must be made for this:
try to answer question from nodes in memory
first ask wolfram alpha, if answer to question dont do anything
if question type can not be answered by crawling (how to?) select appropriate backend
question type how -> scrap wikihow backend
if question can be answered by crawling ask for missing nodes as needed
limit asking to relevant backend depending on the question type
set a priority for each backend when more than 1 is needed
the limiting to question type is important, lets define what each backend is good for:
wolfram alpha:
not structured data collection, but answers user right away, entitys can be tagged in the sentence to populate node-base, exploring the api for ways to get more data useful to populate nodes
answers most questions
parse the answer for entitys (question parser) to populate nodes, but requires other backend to populate info
wikihow
completly unstructured data, returns a list of how tos but is usually not exactly on point
gives a list of how to tutorials
short step by step
extended step by step
pictures of steps
should be used for exclusively how to questions and only when wolfram misses answer
no structured data can be easily taken from there to populate nodes
wikipedia:
unstructured data, but the most usefull for the user, the parsed fields will most likely be the target info to be retrieved by the crawler
picture
short description
extended description
infobox fields
should only be used for the end node of the crawling
wikidata:
This is usefull to get specially seconday node connections and some minor info, propertie lookup must be set in place for properties
get parents
get short description
parsing properties gives a kind of infobox and usefull connections
get picture
not sure for which node to call, maybe start and end node only
dbpedia:
usefull relationship data comes from here, specially for making connections
get picture
get parents
get synonims
get cousins
get links
short subject description
could be called for all nodes, its the most useful
wordnik:
this is a tricky one, it gets mostly grammatical relationships, but some fields are not very factual as long as they are "spoken language compatible", an example case we dont want , "human" is synonim to "fallible"
get synonims
get antonims
get rhymes
get cousins
get word-definitions
should be called when connections are missing and in start/end nodes
should always confirm with user before saving
conceptnet:
this is awesome for connections, however suffers even more from difficult parsing, no fields can be easily put into a connection
examples of "non-parents":
frog -> is a: [u'frog', u'amphibi', u'adornment', u'amphibian', u'French person', u'capture']
chicken -> is a: [u'food', u'meat']
should be called when connections are missing and maybe in start/end nodes
the default use-case scenario will be the crawler wanting to search a node and have no info about it at all
in this case the crawler is expected to ask the various knowledge backends for info
a policy must be made for this:
the limiting to question type is important, lets define what each backend is good for:
wolfram alpha:
not structured data collection, but answers user right away, entitys can be tagged in the sentence to populate node-base, exploring the api for ways to get more data useful to populate nodes
wikihow
completly unstructured data, returns a list of how tos but is usually not exactly on point
wikipedia:
unstructured data, but the most usefull for the user, the parsed fields will most likely be the target info to be retrieved by the crawler
wikidata:
This is usefull to get specially seconday node connections and some minor info, propertie lookup must be set in place for properties
dbpedia:
usefull relationship data comes from here, specially for making connections
wordnik:
this is a tricky one, it gets mostly grammatical relationships, but some fields are not very factual as long as they are "spoken language compatible", an example case we dont want , "human" is synonim to "fallible"
conceptnet:
this is awesome for connections, however suffers even more from difficult parsing, no fields can be easily put into a connection
examples of "non-parents": frog -> is a: [u'frog', u'amphibi', u'adornment', u'amphibian', u'French person', u'capture'] chicken -> is a: [u'food', u'meat']
further thought is needed