apache / age

Graph database optimized for fast analysis and real-time data processing. It is provided as an extension to PostgreSQL.
https://age.apache.org
Apache License 2.0
2.95k stars 402 forks source link

Performance issues with edge creation for larger dataset with 83K edges #1925

Open dianaoa22 opened 2 months ago

dianaoa22 commented 2 months ago

Main.txt Describe the bug I am creating 83K vertices and 83K edges by parsing a json file with JAVA. The vertex creation takes about 1 min however for edges it takes around 1hr+.

How are you accessing AGE (Command line, driver, etc.)?

What data setup do we need to do?


Steps:

1. Attached program reads a JSON file, builds a graph structure in memory based on the JSON content, and then generates and executes Cypher queries to create this graph structure in a PostgreSQL database with the Apache AGE extension for graph processing.
2. Constructs the graph by recursively traversing the JSON structure and creating vertices and edges accordingly.
3. Code snippet attached, input json cannot be shared
4. Optimized with batch execution but no improvement seen, takes about 1 hr+ however the same in neo4j takes 14 mins.
[Main.txt](https://github.com/user-attachments/files/15900495/Main.txt)

**What is the necessary configuration info needed?**
- 

**What is the command that caused the error?**


**Expected behavior**
Execution time should be less atleast close to neo4j.

**Environment (please complete the following information):**
- JAVA 21 ,  postgres version PostgreSQL 16.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (R
ed Hat 8.5.0-10), 64-bit

**Additional context**
Add any other context about the problem here.
MironAtHome commented 2 months ago

Do you have indexes on related fields used to create edge relationship? I tried accessing linked file to look over but apparently couldn't get to the file.

dianaoa22 commented 2 months ago

Code.txt Yes, I have created GIN index as the MATCH query used id inside the properties. CREATE INDEX load_item_gin_idx ON ms_graph."Item" USING gin (properties); CREATE INDEX

{ "id": 844424930132046, "properties": { "id": "3bd6197582434a01b340b889b357f3a4", "label": "xxxx", "properties": "{path: "/c/e", name: "test", s-id: "21977", description: "test", seattr: "2000", state: "enable", customer: "13"}" } }

MironAtHome commented 1 month ago

Here is index structure I found helpful:

CREATE INDEX [CONCURRENTLY] <index_name> ON "<schema/graph_name>"."<vertice_table_name>"
(ag_catalog.agtype_access_operator(properties, '"<property_name>"'::agtype));

Please, replace , <schema/graph_name>, , with applicable values. This way index will assist in matching on specific field in the property set, unless you need to match on all properties. This is first. Secondly, I looked over week - end and found a few tools I find fairly common across RDBMS systems, used to incrementally improve performance by looking at what query part performs slow and than seeing how to improve ( usually it still is index, but we will see ), It's not something that works immediately on a good educated guess, based on experience, as applying indexes does. It's more of a journey. I was hoping this systemic approach would be of interest to explore. If this sounds like acceptable, please read on further. There are a few troubleshooting approaches I learned over the last week - end. Please let me know if you are open to installing extensions, that can help to expose detailed wait information, regarding where query spends time, and turning on some of the troubleshooting options. This, this and this. Feel free to add a comment here. Please don't just install these extensions, but rather see if those meet your Postgres version and see if ( entirely possible ) there are native, postgres itself, maintained extensions, or even built - in functions, that provide same level of troubleshooting. This first step is just to ensure proper tooling is in place to attempt to identify slower parts of query prior attempting to fix. This extensions will not work with Postgres in cloud, hence at all, so, please read carefully and see if this is the right path to troubleshoot query performance in your case. And only if it is, please confirm so that we could look at the next step ( if you would like to have assistance ).

dianaoa22 commented 1 month ago

This would take some time, I will let you know once I have the data, thanks for the suggestions.

MironAtHome commented 1 month ago

Np at all. One thing I was looking for is sample config.json file as it would help me to get to some test data faster.

dianaoa22 commented 1 month ago

import json

Original JSON data

data = { "root": { "child1": { "name": "test", "location": "test" }, "child2": [ { "child2-class": "classname", "child2-number": 1 } ], "child3": [ { "child4": [ { "child4-pref": 1, "child4-type": "type-schild5" }, { "child4-type": "type-schild5", "child4-pref": 2 } ], "child5": [ { "child5-number": 1, "entryp": { "econtrol": { "ploc": "ploc-10001", "state": "enable" } }, "inside": { "mgmt": { "state": "enable" } } } ], "child3-type": "new-child53", "pref-number": 1 }, { "child3-type": "test-schild5-c", "pref-number": 2, "child4": [ { "child4-pref": 1 }, { "child4-pref": 2 } ], "child5": [ { "entryp": { "econtrol": { "ploc": "ploc-10001", "state": "enable" } }, "child5-number": 1, "inside": { "mgmt": { "state": "enable" } } } ] }, { "child3-type": "new-child53", "pref-number": 3, "child4": [ { "child4-type": "type-schild5", "child4-pref": 1 }, { "child4-type": "type-schild5", "child4-pref": 2 } ], "child5": [ { "inside": { "mgmt": { "state": "enable" } }, "child5-number": 1 } ] }, { "child4": [ { "child4-type": "type-schild5", "child4-pref": 1 }, { "child4-pref": 2, "child4-type": "type-schild5" } ], "child5": [ { "inside": { "mgmt": { "state": "enable" } }, "child5-number": 1 } ], "child3-type": "new-child53", "pref-number": 4 }, { "child3-type": "new-child53", "pref-number": 5, "child4": [ { "child4-pref": 1, "child4-type": "type-schild5" } ], "child5": [ { "inside": { "mgmt": { "state": "enable" }, "band": { "band-no": "NQP-10001" } }, "child5-number": 1 } ] }, { "child4": [ { "child4-type": "type-schild5", "child4-pref": 1 }, { "child4-type": "type-schild5", "child4-pref": 2 } ], "child5": [ { "inside": { "mgmt": { "state": "enable" } }, "child5-number": 1, "entryp": { "econtrol": { "state": "enable", "ploc": "ploc-10001" } } } ], "child3-type": "new-child53", "pref-number": 6 }, { "child3-type": "new-child53", "pref-number": 8, "child4": [ { "child4-type": "type-schild5", "child4-pref": 1 } ], "child5": [ { "inside": { "mgmt": { "state": "enable" }, "net": { "test": "FRL-10001" } }, "child5-number": 1 } ] } ] } }

def add_entries(data, num_entries): base_child3 = { "child3-type": "new-child53", "child4": [], "child5": [], "pref-number": None # Will be set dynamically }

base_child4 = {
    "child4-pref": None,  # Will be set dynamically
    "child4-type": "type-schild5"
}

base_child5 = {
    "child5-number": None,  # Will be set dynamically
    "entryp": {
        "econtrol": {
            "ploc": "ploc-10001",
            "state": "enable"
        }
    },
    "inside": {
        "mgmt": {
            "state": "enable"
        }
    }
}

# Generate new child3 entries
new_child3s = []
for i in range(1, num_entries + 1):
    child3 = base_child3.copy()
    child3['pref-number'] = i

    # Generate child4 entries for the child3
    child4_entries = []
    for j in range(1, 5):  # 4 child4 entries per child3 for example
        child4 = base_child4.copy()
        child4['child4-pref'] = j
        child4_entries.append(child4)
    child3['child4'] = child4_entries

    # Generate child5 entries for the child3
    child5_entries = []
    for j in range(1, 5):  # 4 child5 entries per child3 for example
        child5 = base_child5.copy()
        child5['child5-number'] = j
        child5_entries.append(child5)
    child3['child5'] = child5_entries

    new_child3s.append(child3)

# Add new child3s to the data
data['root']['child3'].extend(new_child3s)

Specify the number of entries to add

num_entries = 100000

Add the entries

add_entries(data, num_entries)

Print the total number of child3 entries to confirm

print(f"Total child3 entries: {len(data['root']['child3'])}")

Optional: Save the updated data to a file

with open('big_data.json', 'w') as file: json.dump(data, file, indent=4)

print("Data added successfully and saved to updated_data.json")

Run this python program to get a big data, I am unable to imprt the json due to size restriction of 25MB

MironAtHome commented 1 month ago

Thank you