Closed hailiang-wang closed 4 years ago
networkx graph performance: (testing on 10W vocabulary size, data structure: Graph vs. Dictionary)
1) . Directed graph vs. Dictionary : ('current vocab size : ', 125792) ('graph time 100000 : ', 0.2830028533935547) ('graph neighbors number : ', 10) ('dict time 100000 : ', 0.013113021850585938) ('dict neighbors number : ', 10)
2) .Undirected graph vs. Dictionary : ('current vocab size : ', 125792) ('graph time 100000 : ', 0.2830028533935547) ('graph neighbors number : ', 32) ('dict time 100000 : ', 0.012874603271484375) ('dict neighbors number : ', 10)
So, we can get two conclusions: (1) Graph is slower than dictionary, and the time consumption of the directed graph and the undirected graph is similar. (2) Undirected graph may get an indefinite number of synonymous.
So, maybe for the low level APIs in Synonyms, we just use dict for quick access. But networkx provides many fancy APIs for upper level tasks like textsum and NER, such as pagerank and textrank. https://networkx.github.io/documentation/latest/tutorial.html
Unless networkx can achieve better performance, we just rely on dict.
Another option is using levelDB. It may bring more inconvenience with installation, but it may help with both performance and abilities.
I read the source code of NetworkX where is about finding neighbors, as following:
def successors_iter(self,n):
"""Return an iterator over successor nodes of n.
neighbors_iter() and successors_iter() are the same.
"""
try:
return iter(self.succ[n])
except KeyError:
raise NetworkXError("The node %s is not in the digraph."%(n,))
def predecessors_iter(self,n):
"""Return an iterator over predecessor nodes of n."""
try:
return iter(self.pred[n])
except KeyError:
raise NetworkXError("The node %s is not in the digraph."%(n,))
def successors(self, n):
"""Return a list of successor nodes of n.
neighbors() and successors() are the same function.
"""
return list(self.successors_iter(n))
def predecessors(self, n):
"""Return a list of predecessor nodes of n."""
return list(self.predecessors_iter(n))
# digraph definitions
neighbors = successors
neighbors_iter = successors_iter
So I guess there are two reasons for the slow query : (1) twice type conversion operation , (2)Multiple function calls
And then I just modified the code like this :
def neighbors(self, n):
try:
return self.succ[n]
except KeyError:
raise NetworkXError("The node %s is not in the digraph."%(n,))
It runs faster, only two times slower than Dict. However, I think they both run fast enough.
Great job, does the modifications have side impacts for other functions in networks?
We may borrow the codes of networkx and make the changes right now and contribute back afterwards.
https://github.com/snap-stanford/snap
FYI, this library has a C++ interface.
@tyrinwu thanks.
description
目前使用python dict, list 来存储,在建立多个词之间关系的时候,效率低。
solution
将word's vector, adjacent words and their distance, 存储在Graph中。
Possible solution: 方案1: graphlite https://pypi.python.org/pypi/graphlite
方案2: projx http://davebshow.github.io/projx/getting-started/
方案3:networkx https://pypi.python.org/pypi/networkx/
支持 load 和 dump 文件 pickle 文件。 采用哪种格式? https://networkx.github.io/documentation/networkx-1.9.1/reference/readwrite.html
安装是否方便? 保持从 pip 安装,依赖少,跨平台
是否支持高级查询?比如 cypher
性能怎么样?