Closed dyusuf closed 2 years ago
Reopening issue.
@bornabesic It took around 10 mins to complete a neo4j query with ~3k proteins that have ~26k associations. too slow. any input?
Neo4j documentation offers some tips for performance tuning. I remember increasing the limit for the number of open files used to help. We should also be aware that theoretically for 3k proteins the maximum amount of associations is 4.5M. Therefore, tweaking of rewriting the query itself in a smarter way might result in a performance boost.
@bornabesic @f2010126
MATCH (source:Protein)-[association:ASSOCIATION]->(target:Protein)
WHERE source.id IN {protein_ids} AND target.id IN {protein_ids} AND association.combined >= {threshold}
RETURN source, target, association.combined AS score
the query looks pretty simple and straight forward. any room for further tuning?
increasing the limit for the number of open files
@bornabesic where to adjust this accordingly? I recall you talked about it before. Well, I can not remember the details.
@f2010126 please follow up with Borna. this performance issue is critical. making it groundless of using graph database over SQL database.
@f2010126 https://reactome.org/dev/graph-database
@dyusuf
where to adjust this accordingly? I recall you talked about it before. Well, I can not remember the details.
All the details are explained in the link to the part of Neo4j documentation I provided.
the query looks pretty simple and straight forward. any room for further tuning?
I think the main pitfall is the matching of protein IDs.
I can bet that IN
operator performs a linear search in WHERE source.id IN {protein_ids} AND target.id IN {protein_ids}
.
That means for n
proteins in the query and m
proteins in the database, the search time complexity is O(nm)
and that is a lot of comparisons to make. I am not sure if Neo4j supports any data structure (like set for example) to speed up the lookup.
On profiling the query in Neo4j the following plan is obtained. The Expand and CartesianProduct operations have the maximum db hits and would need to be addressed.
Reference Links:
https://neo4j.com/docs/cypher-manual/current/query-tuning/
https://neo4j.com/blog/tuning-cypher-queries/
A list of 3000 proteins takes upto 10 min to display. The bulk of the time ~ 9.5 min is spent in fetching data from the node4j database. Time to Fetch Data: 502.1668871660004 seconds The read operation timed out Time to Process Gephi: 12.358108850001372 seconds Time to Process Graph: 0.5501226920005138 seconds