chatopera / Synonyms

:herb: 中文近义词:聊天机器人,智能问答工具包
https://bot.chatopera.com/
Other
5.03k stars 902 forks source link

Use in-memory graph struct for storing and quering data #7

Closed hailiang-wang closed 4 years ago

hailiang-wang commented 7 years ago

description

目前使用python dict, list 来存储,在建立多个词之间关系的时候,效率低。

solution

将word's vector, adjacent words and their distance, 存储在Graph中。

Possible solution: 方案1: graphlite https://pypi.python.org/pypi/graphlite

方案2: projx http://davebshow.github.io/projx/getting-started/

方案3:networkx https://pypi.python.org/pypi/networkx/

image

支持 load 和 dump 文件 pickle 文件。 采用哪种格式? https://networkx.github.io/documentation/networkx-1.9.1/reference/readwrite.html

安装是否方便? 保持从 pip 安装,依赖少,跨平台

是否支持高级查询?比如 cypher

性能怎么样?

hailiang-wang commented 7 years ago

question with networkx for C codes

https://github.com/networkx/networkx/issues/2747

huyingxi commented 7 years ago

networkx graph performance: (testing on 10W vocabulary size, data structure: Graph vs. Dictionary)

1) . Directed graph vs. Dictionary : ('current vocab size : ', 125792) ('graph time 100000 : ', 0.2830028533935547) ('graph neighbors number : ', 10) ('dict time 100000 : ', 0.013113021850585938) ('dict neighbors number : ', 10)

2) .Undirected graph vs. Dictionary : ('current vocab size : ', 125792) ('graph time 100000 : ', 0.2830028533935547) ('graph neighbors number : ', 32) ('dict time 100000 : ', 0.012874603271484375) ('dict neighbors number : ', 10)

So, we can get two conclusions: (1) Graph is slower than dictionary, and the time consumption of the directed graph and the undirected graph is similar. (2) Undirected graph may get an indefinite number of synonymous.

hailiang-wang commented 7 years ago

So, maybe for the low level APIs in Synonyms, we just use dict for quick access. But networkx provides many fancy APIs for upper level tasks like textsum and NER, such as pagerank and textrank. https://networkx.github.io/documentation/latest/tutorial.html

Unless networkx can achieve better performance, we just rely on dict.

Another option is using levelDB. It may bring more inconvenience with installation, but it may help with both performance and abilities.

huyingxi commented 7 years ago

I read the source code of NetworkX where is about finding neighbors, as following:


    def successors_iter(self,n):
        """Return an iterator over successor nodes of n.

        neighbors_iter() and successors_iter() are the same.
        """
        try:
            return iter(self.succ[n])
        except KeyError:
            raise NetworkXError("The node %s is not in the digraph."%(n,))

    def predecessors_iter(self,n):
        """Return an iterator over predecessor nodes of n."""
        try:
            return iter(self.pred[n])
        except KeyError:
            raise NetworkXError("The node %s is not in the digraph."%(n,))

    def successors(self, n):
        """Return a list of successor nodes of n.

        neighbors() and successors() are the same function.
        """
        return list(self.successors_iter(n))

    def predecessors(self, n):
        """Return a list of predecessor nodes of n."""
        return list(self.predecessors_iter(n))

    # digraph definitions
    neighbors = successors
    neighbors_iter = successors_iter

So I guess there are two reasons for the slow query : (1) twice type conversion operation , (2)Multiple function calls

And then I just modified the code like this :

    def neighbors(self, n):
        try:
            return self.succ[n]
        except KeyError:
            raise NetworkXError("The node %s is not in the digraph."%(n,))

It runs faster, only two times slower than Dict. However, I think they both run fast enough.

hailiang-wang commented 7 years ago

Great job, does the modifications have side impacts for other functions in networks?

We may borrow the codes of networkx and make the changes right now and contribute back afterwards.

tyrinwu commented 7 years ago

https://github.com/snap-stanford/snap

FYI, this library has a C++ interface.

hailiang-wang commented 7 years ago

@tyrinwu thanks.