Mayil-AI-Sandbox / kuzudb_jan15

MIT License
0 stars 0 forks source link

Counting out_neighbours crashes #63

Open meido opened 7 months ago

meido commented 7 months ago

Using the Pokec data, https://snap.stanford.edu/data/soc-pokec.html, which has about 1.6m nodes and 30m edges, we test loading time, how long it takes to count the degrees, out_neighbours, etc.

This query crashes:

MATCH (u:User)-[:Follows]->(n:User) RETURN u.id, COLLECT(n.id) AS out_neighbours Obtaining the data per node however works:

for user_id in range(1, max_user_id + 1): query = f"MATCH (u:User {{id: {user_id}}})-[:Follows]->(n:User) RETURN u.id, COLLECT(n.id) AS out_neighbours" res = self.conn.execute(query) df = res.get_as_df()

mayil-ai[bot] commented 7 months ago

Summary: Query to count out_neighbours crashes when executed on a large dataset with 1.6m nodes and 30m edges.

Possible Solution

Based on the provided information, the issue seems to be related to the execution of a query that collects out_neighbours for all User nodes, which crashes possibly due to resource constraints like memory when handling large result sets. The working code snippet suggests that executing the query for each User node individually does not cause a crash.

To address the issue:

Here is an example of how you might modify the query to include pagination:

page_size = 1000  # Adjust the page size based on memory constraints and performance
skip = 0
while True:
    paginated_query = f"""
    MATCH (u:User)-[:Follows]->(n:User)
    RETURN u.id, COLLECT(n.id) AS out_neighbours
    SKIP {skip} LIMIT {page_size}
    """
    res = self.conn.execute(paginated_query)
    df = res.get_as_df()
    if df.empty:
        break
    # Process the DataFrame `df` as needed
    skip += page_size

Code snippets to check