Graph initial load speed via put is extremely slow

jpbrown-15 commented 1 year ago

This is a question, not an issue. I'm following the basic documentation for creating a graph using put to populate the graph:

# We have a filename and path - begin parsing the XML file
    tree = ET.parse(filename)

    root = tree.getroot()

    datagraph = Graph("graph1")
    for child in root:
        if not re.search("OpenPositions$", child.tag):  # skip elements we don't care about
            print('children of root: ',child.tag)
            tradeid = child.find("TradeID")
            marketstate = child.find("MarketState")
            strategy = child.find("Strategy")
            entrydate = child.find("EntryDate")
            exitdate = child.find("ExitDate")
            pctlossgain = child.find("GainLossPercent")
            print('Trade: ', tradeid.text, marketstate.text, strategy.text, entrydate.text, exitdate.text, pctlossgain.text)

            # populate the graph db
            datagraph.put(child.tag,"hastrade",tradeid.text)
            datagraph.put(tradeid.text,"marketstate",marketstate.text)
            datagraph.put(tradeid.text,"strategy",strategy.text)
            datagraph.put(tradeid.text,"entrydate",entrydate.text)
            datagraph.put(tradeid.text,"exitdate",exitdate.text)
            datagraph.put(tradeid.text,"pctlossgain",pctlossgain.text)

The input file for the tree is an XML document being parsed by ElementTree and contains around 3 million rows (114MB file). Of those 3 million rows, I need data from about 100,000 rows. Without populating the graph db, the code will rip through and print out the data in under 10 seconds. However, populating the graph has taken approximately 3 hours. The largest of the put statements is approximately 51 bytes in size. Are there faster ways to populate the db? I'm looking for a method that would load the graph in under a minute.

Also, once the graph is created, the data is on disk. To access it in the future (such as the next day or a week from now), do I use the same Graph(graph_name) to connect to it and begin using it again? Sorry if it is in the doc, I just couldn't find it.

Thanks for creating Cog.

arun1729 commented 1 year ago

Thank you for your interest in the project! And also for taking the time to provide a detailed explanation of the issue you are facing. For large datasets this may indeed be an issue, I will run a test to see how much time it takes for 3 million puts.

What queries are you planning to do on your data? If it is a simple (or a small chain of) key value lookups, you could probably use the low level API which is much faster than the graph API. The limitation is that you can only use key value look up for querying. Here is an example:

from cog.core import Record
from cog.database import Cog
import os
from cog import config

# Set up database path
db_path = '/tmp/cogtestdb'
os.makedirs(db_path, exist_ok=True)  # Create directory if it doesn't exist
config.CUSTOM_COG_DB_PATH = db_path  # Set the custom database path in cog's config

# Create an instance of the Cog class (establish a connection to the database)
cogdb = Cog()

# Create or load a namespace (a container for tables)
cogdb.create_or_load_namespace("my_namespace")

# Create a new table in the specified namespace
cogdb.create_table("new_db", "my_namespace")

# Store records in the active table
# Record should be an instance of the Record class from cog.core
cogdb.put(Record('A', 'val'))
cogdb.put(Record('B', 'val'))
cogdb.put(Record('key3', 'val'))
cogdb.put(Record('key3', 'val_updated'))  # Updating a value for a given key is as simple as putting a new record with that key

# Retrieve a record by its key from the active table
record = cogdb.get('key3')
print(record.key, record.value)  # Outputs: key3 val_updated

# Close the database connection when done
cogdb.close()

Also, once the graph is created, the data is on disk. To access it in the future (such as the next day or a week from now), do I use the same Graph(graph_name) to connect to it and begin using it again? Sorry if it is in the doc, I just couldn't find it.

Yes, for loading the graph again you just need to run Graph(graph_name).

jpbrown-15 commented 1 year ago

I have to say that this is a very interesting project.

I need to amend some of my original statements - I hadn't let the process complete before creating this issue.

For clarity, the 3 million rows contain fewer elements (less than 120,000 xml elements). Of those, I'm only interested in 114,278 elements. To rip through the file and not create the graph, it runs through the XML in under 10 seconds, printing out the data of interest. To create the graph, it takes 8.5 hours.

Each element of interest generates 6 puts, so in total, that is 685,668 puts being issued. 8.5 hours to issue the 685k of puts is approximately 22.4 puts per second. You'll see from the original code "tradeid" vertex has essentially 5 attributes being added to it. The initial XML file is approximately 114MB. The resulting graph is 210MB.

Here is a link to a zipped version of the data file I used.

Here's my use case. I'm building a Monte Carlo Simulation app. Monte Carlo Simulations are used to randomize events (in the world of stock trading) to assess the likelihood of success if trades happened in a different sequence. So, my queries will be to randomly select a "tradeid" from within the graph from between 2 dates, using the EntryDate or ExitDate and then using the PctLossGain for a calculation. Likely I will be pulling the subset of trades into memory that meet the criteria instead of hitting the graph for each piece of data. Once a simulation is done, I have results that I will then be adding to the graph for future reference.

I'm currently running a test with only one put to see how long it takes. Then I will change things to parse the XML into a list and then process the list to create the graph -- thereby eliminating xml parsing from the equation. While I don't expect the XML parsing to be the bottleneck, I'm trying to eliminate all other possibilities. Your performance numbers would indicate the 685k of puts should take a maximum of approximately 23 minutes. I'm far away from that number; even with the current test I'm running.

I'll try the suggestion to use the low level API as well and report back.

Thanks again for creating cogdb. I think it is a fantastic add to python.

jpbrown-15 commented 1 year ago

I completed the two tests previously mentioned:

Simplified the puts to only 1 put - the first one shown in the code. 114,278 puts took 2.75 hours as fed via the xml tree.
Removed the xml tree from the equation by storing the data in a list and the reading the list to create the put. This time it took 2.8 hours which is just a few minutes longer.

So, I conclude that the xml processing to get the data is not causing delays, rather the put itself is where the delay is occurring.

Next, I will try the low level API to see how that changes throughput.

Regarding the low level API, is there manner in which I should map the high level API's "vertex edge vertex" into the low level API's records and keys?

I'm not clear on how I would transform vertex edge vertex into the below to have any meaningful data:

cogdb.put(Record('A', 'val'))
cogdb.put(Record('B', 'val'))
cogdb.put(Record('key3', 'val'))

Thanks.

jpbrown-15 commented 1 year ago

I did some further testing to see what's happening and used cProfiler to give me a sense for where the time is being spent.

Using just the one put statement and allowing it to do inserts of 10,000 records, cProfiler is telling me I'm spending 614 seconds (10.23 minute) in cog's load_from_store.

Drilling down within load_from_store, __load_value had a total of 613 seconds. Within it, unmarshal 446 seconds while read was 107 seconds.

I'm running on a system with spinning disk instead of SSD. I did run a benchmark on my storage for read and write. The low end of the performance was 259 MB/s with average access time of 0.05 msec. That looks adequate.

I think I'll have to rethink my database design and perhaps retain the XML as my datastore, and use cogdb to hold the metadata about the XML that is useful for the simulation. More to think through.

I still love the pure python approach for cogdb -- keeps things simple for me.

arun1729 commented 1 year ago

I did some further testing to see what's happening and used cProfiler to give me a sense for where the time is being spent.

Using just the one put statement and allowing it to do inserts of 10,000 records, cProfiler is telling me I'm spending 614 seconds (10.23 minute) in cog's load_from_store.

Drilling down within load_from_store, __load_value had a total of 613 seconds. Within it, unmarshal 446 seconds while read was 107 seconds. ....

Appreciate all the testing you have been doing! I did some investigation into the issue and found that for loading a graph the performance bottleneck is a disk based linked list that is used heavily while building the graph. Inserting data into this linked list is extremely slow and gets slower as the list grows. I am working on a solution for this issue currently and hope to have it out soon.

I still love the pure python approach for cogdb -- keeps things simple for me.

Glad to hear! :)

arun1729 commented 1 year ago

So, I conclude that the xml processing to get the data is not causing delays, rather the put itself is where the delay is occurring.

Thanks for testing and narrowing down the issue.

jpbrown-15 commented 1 year ago

Inserting data into this linked list is extremely slow and gets slower as the list grows. I am working on a solution for this issue currently and hope to have it out soon.

I look forward to testing it whenever it is available!

arun1729 commented 1 year ago

@jpbrown-15, I created a new release cogdb 3.0.6; it adds a new caching feature that reduces the number of disk reads it needs to make while loading a graph, it should help with reducing graph load times, please try it out.

jpbrown-15 commented 1 year ago

@arun1729, I've been testing version 3.0.6 and am experiencing an error with the cache.move_to_end routine: `Traceback (most recent call last):

File "/home/jpb/PycharmProjects/montecarlo/montecarlo/data/importxml.py", line 108, in importFile datagraph.put(tradeid,"marketstate",marketstate)

File "/home/jpb/.local/share/virtualenvs/montecarlo-KMWo43Ck/lib/python3.10/site-packages/cog/torque.py", line 220, in put self.cog.put_node(vertex1, predicate, vertex2)

File "/home/jpb/.local/share/virtualenvs/montecarlo-KMWo43Ck/lib/python3.10/site-packages/cog/database.py", line 360, in put_node self.use_table(predicate_hashed).put_set(Record(out_nodes(vertex1), vertex2))

File "/home/jpb/.local/share/virtualenvs/montecarlo-KMWo43Ck/lib/python3.10/site-packages/cog/database.py", line 272, in put_set self.cache.move_to_end(cache_key)

KeyError: ('3323038467', '0:out:')`

When I run with version 3.0.5, I do not get this error. I'm using the same xml input file for both tests. I'll instrument my code a bit to see what record in the XML is causing the problem -- it occurs about 1 hour into the run. While an hour is still too long for my purposes, I want to get a timing on how much faster the caching changes are. With version 3.0.5, the run is 8 hours. We'll see how version 3.0.6 does when I get past the above error.

jpbrown-15 commented 1 year ago

The error is occurring when we have a duplicate key.

The data just before the error shows a recycle in the tradeid from 27260 to 0. The first occurrence of tradeid 0 was with method.tag = IntRS-J-InterferenceEntryNoFilters. As it moves to the next method.tag, the tradeid normally recycles to 0. You will see the put sequence below. It successfully performs the first put of datagraph.put(method.tag,"hastrade",tradeid) but then throws an error with the second put datagraph.put(tradeid,"marketstate",marketstate) since tradeid = 0 was previously used.

data committing to graph:  IntRS-J-InterferenceEntryNoFilters 27260 Fear Interference_v2-intRSNoFilters 2023-06-27T08:30:00-05:00 2023-06-27 -0.003085
data committing to graph:  IntRS-J-InterferencyEntry 0 Greed Interference_v2-intRS 2012-06-26T08:30:00-05:00 2012-06-29 0.08085

# populate the graph db
            print('data committing to graph: ', method.tag, tradeid, marketstate, strategy, entrydate, exitdate, pctlossgain)
            datagraph.put(method.tag,"hastrade",tradeid)
            datagraph.put(tradeid,"marketstate",marketstate)
            datagraph.put(tradeid,"strategy",strategy)
            datagraph.put(tradeid,"entrydate",entrydate)
            datagraph.put(tradeid,"exitdate",exitdate)
            datagraph.put(tradeid,"pctlossgain",pctlossgain)

I will have to explore the graph further to see what was happening under 3.0.5 -- I'm going to guess an update was occurring instead of an insert of different attributes since my key was obviously the same.

arun1729 commented 1 year ago

@jpbrown-15 thanks for testing and pointing out the issue! I have released 3.0.7 with a fix for the issue. Thanks.

jpbrown-15 commented 1 year ago

@arun1729 thanks for the fix and the improvement in performance.

I ran my tests and the new version is twice as fast. Under 3.0.5, the xml file I use for input took just over 8.5 hours to process. With the enhancements in 3.0.7, the same xml file was processed in approximately 4 hours. Nicely done!

arun1729 / cog

Graph initial load speed via put is extremely slow #45