BackofenLab / AxoWise

An integrated graph system of biological entities, functional terms, and publications.
Apache License 2.0
5 stars 0 forks source link

Performance Improvement for displaying large number of proteins #11

Closed dyusuf closed 2 years ago

dyusuf commented 4 years ago

A list of 3000 proteins takes upto 10 min to display. The bulk of the time ~ 9.5 min is spent in fetching data from the node4j database. Time to Fetch Data: 502.1668871660004 seconds The read operation timed out Time to Process Gephi: 12.358108850001372 seconds Time to Process Graph: 0.5501226920005138 seconds

f2010126 commented 4 years ago

3kTestSet.txt

Reopening issue.

dyusuf commented 4 years ago

@bornabesic It took around 10 mins to complete a neo4j query with ~3k proteins that have ~26k associations. too slow. any input?

bornabesic commented 4 years ago

Neo4j documentation offers some tips for performance tuning. I remember increasing the limit for the number of open files used to help. We should also be aware that theoretically for 3k proteins the maximum amount of associations is 4.5M. Therefore, tweaking of rewriting the query itself in a smarter way might result in a performance boost.

dyusuf commented 4 years ago

@bornabesic @f2010126

MATCH (source:Protein)-[association:ASSOCIATION]->(target:Protein)
WHERE source.id IN {protein_ids} AND target.id IN {protein_ids} AND association.combined >= {threshold}
RETURN source, target, association.combined AS score

the query looks pretty simple and straight forward. any room for further tuning?

increasing the limit for the number of open files

@bornabesic where to adjust this accordingly? I recall you talked about it before. Well, I can not remember the details.

@f2010126 please follow up with Borna. this performance issue is critical. making it groundless of using graph database over SQL database.

dyusuf commented 3 years ago

@f2010126 https://reactome.org/dev/graph-database

bornabesic commented 3 years ago

@dyusuf

where to adjust this accordingly? I recall you talked about it before. Well, I can not remember the details.

All the details are explained in the link to the part of Neo4j documentation I provided.

the query looks pretty simple and straight forward. any room for further tuning?

I think the main pitfall is the matching of protein IDs. I can bet that IN operator performs a linear search in WHERE source.id IN {protein_ids} AND target.id IN {protein_ids}. That means for n proteins in the query and m proteins in the database, the search time complexity is O(nm) and that is a lot of comparisons to make. I am not sure if Neo4j supports any data structure (like set for example) to speed up the lookup.

f2010126 commented 3 years ago

On profiling the query in Neo4j the following plan is obtained. The Expand and CartesianProduct operations have the maximum db hits and would need to be addressed. Reference Links:
https://neo4j.com/docs/cypher-manual/current/query-tuning/ https://neo4j.com/blog/tuning-cypher-queries/

Screenshot 2020-12-16 at 18 31 22