Concept Crawler - Githubissues

JarbasAI commented 7 years ago

A mechanism is needed to give relational information about nodes, the following data is available for each node:

parents - every node is an example of its parent
childs - nodes may be an example of its child or not, but childs are examples of this node
generation of childs and parents - relevance measurement
antonims - this node can never connect to an antonim
synonims - these nodes mean the same
data - arbitrary data to be retrieved about this concept

parent: mammal is parent of cow , therefore cow is a mammal
child: dolly is child of cow therefore dolly is a cow, a cow may or not be dolly generation: cow is more related to mammal than to living being data: " cows are mad!"

Navigating these connections to identify the relevant nodes by the knowledge engine will need several tools

The following considerations should be kept track to direct and extract meaning from crawling:

number of hops until a concept is reached (minimize node travel distance)
strenght of connections between hops (chose best connections when choosing route)
antonims (nodes to avoid, path from here on is invalid )
synonims (jump to synonim for better path and conclusions)

gens: if "cow" is a "mammal" search "mammal" connections before "living being" connections antonims: if "dolly the cow" is a child of dead and i reached ConceptNode alive, this path is wrong synonims: if "trump" is a synonim of "current US president" and "current US president" isnt in search path, add "current Us president" and its connections to search path

Question: How to minimize number of hops, during crawling itself?

ConceptCrawler will be the base class responsible for:

creating a "search tree" starting at target node
travel along the tree and search nodes / retrieve data from nodes
give info about tree size, change tree size (only keep N hops, other concepts will be probably unrelated), update tree
retrieve distance between nodes and other crawl data

This data should then be ready to be consumed by other applications and to deduce meaning from

JarbasAI commented 7 years ago

Drunk_Crawl idea (because just stumbles around looking for familiar things until target is reached)

crawl up - this will answer questions of the sort " is CenterNode a TargetNode ?"

- start at CenterNode and build a concept_tree with all parent (and parents of parents...) nodes and N layers/hops (depth configurable)
- while not in TargetNode
     - check if CurrentNode has synonims, if yes prefer synonim and smaller gen parents of synonim as next node
     - check if CurrentNode is antonym of any previous node, if yes go back and choose another
     - choose a random node coming out from CurrentNode, prefer higher gens
     - if no next node go back and search next guess
     - if end of tree return False
- return True

JarbasAI commented 7 years ago

addendum: Learning Crawl

in some cases we will reach an already visited node, example:

human has child -> human female joana has parents -> human, human female

joana is human human is ['mammal', 'animal'], animal is ['alive'], mammal is animal <- stronger relation to animal joana is female female is ['human'], <- stronger relation to human

revisiting nodes should make it preferable to choose next a not yet visited node from this revisited-node instead of simply ignoring this connection and going to next node

keep track of number of times we came across this node during crawl
if we came across this node more times than the medium times we go across each node, check if we already checked its parents and prefer this path
if we visited this node above threshold number of times, check its children, if any of children is a parent of any of the visited nodes, prefer a connection for one of these nodes

an example of crawling, i will use the format current_node:[parents] - numbe of visits for each step

human:[mammal, animal, ape, hominid, living being] -1
animal:[living being]-1
living being: [] - 1
mammal:[animal, living being] - 1
animal:[living being] - 2 <- living being already checked, bypass but increase count - living being = 2
living being (from mammal) - <- already checked, , bypass but increase count - living being = 3
ape: [mammal, animal, living being, omnivore] - 1 <- all nodes already checked, bypass but increase count

we crawled animal -> 3 times living being -> 4 times mammal->2 times ape->1 time

this information is useful to be consumed by questions of the kind: "talk about humans", using the above crawl we would get

humans are animals
humans like animals are living beings
humans are mammals
humans are apes

if during crawl we check for a threshold of 3, living beings and animals would be above threshold,

- check children of living beings and animals and update visit counter
       - living beings: oxygen breathing organism, animals (+1)
       - animals: herbivore, carnivore, omnivore

animals can be hervibores, carnivores or omnivores, are humans a carnivore?
- update nodes on answer, and crawl this node

instead of asking, and we are just crawling without user interaction, we could check if any of the childs is the parent of a visited node we didnt check yet, we didnt visit omnivore from ape yet in this crawl, and it is a child of animal, so we should prefer this node for crawling next

humans are omnivores like apes

this should be good for learning

ElliotTheRobot commented 7 years ago

Very well explained. Yes the "visit count" is a good idea as it will prevent Lilacs from always returning the same standard answer.

Question: Are humans apes? First time (with 'animal' as the only shared parent node to both) Mycroft responds: A: Humans and apes are animals, but a human is not an ape.

Question: Are humans omnivores? LILACS researches humans as omnivores and adds the omnivore parent node. Mycroft responds: A: Yes, humans are omnivores.

Question: Are apes omnivores? LILACS researchers apes and adds omnivore as parent node. Mycroft responds: A: Yes, apes are omnivores.

Now... Same question as before: Are humans apes? LILACS finds the ape and human nodes but now crawls the more proximal parent nodes with a lower visit count and Mycroft responds: A: Yes, humans and apes are omnivores, but a human is not an ape

So in essence, the answer becomes more accurate over time. This kind of machine learning will make Mycroft provide more accurate responses as the LILACS system learns

ElliotTheRobot commented 7 years ago

We could also use the concept of "Supernodes" (nodes with more than X amount of children) to create 'areas' that will help speed up the crawl rate when searching for a concept.

The human brain works this way, we have different areas of our brain the deal with different types of information.

So if a search deals with 'human' and 'ape' then we can deduce that we are dealing with two nodes in the 'omnivore' supernode.

Here's a example diagram to explain: supernodes

JarbasAI commented 7 years ago

this generic function is now implemented, other crawling strategies and improvements should be open as new issues

commit https://github.com/ElliotTheRobot/LILACS-mycroft-core/commit/ade8f29483b6f3a6cf096cf3ffe064939be17ae5

test case outputs

2017-04-07 19:37:40,230 - CLIClient - INFO - Speak: answer to is joana a frog is False
2017-04-07 19:37:40,230 - CLIClient - INFO - Speak: answer to is joana a animal is True
2017-04-07 19:37:40,231 - CLIClient - INFO - Speak: answer to is joana a mammal is True
2017-04-07 19:37:40,240 - CLIClient - INFO - Speak: answer to is joana alive is True

with following crawl logs

2017-04-07 19:53:57,564 - Skills - INFO - start node: joana
2017-04-07 19:53:57,566 - Skills - INFO - target node: mammal
2017-04-07 19:53:57,566 - Skills - INFO - next: human
2017-04-07 19:53:57,566 - Skills - INFO - choosing next node
2017-04-07 19:53:57,571 - Skills - INFO - crawled nodes: ['joana', 'human']
2017-04-07 19:53:57,571 - Skills - INFO - uncrawled nodes: ['female', 'mammal', 'animal']
2017-04-07 19:53:57,571 - Skills - INFO - next: animal
2017-04-07 19:53:57,571 - Skills - INFO - choosing next node
2017-04-07 19:53:57,571 - Skills - INFO - crawled nodes: ['joana', 'human', 'animal']
2017-04-07 19:53:57,572 - Skills - INFO - uncrawled nodes: ['female', 'mammal', 'alive']
2017-04-07 19:53:57,572 - Skills - INFO - next: alive
2017-04-07 19:53:57,572 - Skills - INFO - choosing next node
2017-04-07 19:53:57,572 - Skills - INFO - crawled nodes: ['joana', 'human', 'animal', 'alive']
2017-04-07 19:53:57,573 - Skills - INFO - uncrawled nodes: ['female', 'mammal']
2017-04-07 19:53:57,573 - Skills - INFO - next: mammal
2017-04-07 19:53:57,573 - Skills - INFO - choosing next node
2017-04-07 19:53:57,573 - Skills - INFO - crawled nodes: ['joana', 'human', 'animal', 'alive', 'mammal']
2017-04-07 19:53:57,573 - Skills - INFO - uncrawled nodes: ['female']

ElliotTheRobot / LILACS-mycroft-core

Concept Crawler #9