DerwenAI / kglab

Graph Data Science: an abstraction layer in Python for building knowledge graphs, integrated with popular graph libraries – atop Pandas, NetworkX, RAPIDS, RDFlib, pySHACL, PyVis, morph-kgc, pslpython, pyarrow, etc.
https://derwen.ai/docs/kgl/
MIT License
574 stars 65 forks source link

load_parquet slow for kg with 200k nodes #202

Open davidshumway opened 2 years ago

davidshumway commented 2 years ago

Very nice library!

Just exploring a little and noticed that load_parquet seems to be hanging when loading from a saved parquet file. At least, it's taking a lot longer to read the kg from file than it did to create the original kg. While it takes 2 minutes to generate the kg from a csv (kg.add(...)), it's taking over 15 minutes to load the file and appears to be hanging? Any ideas?

The parquet file is ~9MB, and the kg has 200k nodes and 4 Literal relations per node.

The code to load the file is:

kg2 = kglab.KnowledgeGraph(
  name = "...",
  base_uri = "/ex/",
  namespaces = {
    'sosa': 'http://www.w3.org/ns/sosa/'
  },
)
import time
t0 = time.time()
kg2.load_parquet('kg.parquet')
print('Read time: {}s'.format(round((time.time() - t0), 2)))
measure = kglab.Measure()
measure.measure_graph(kg)
print("edges", measure.get_edge_count())
print("nodes", measure.get_node_count())
# edges 1018040
# nodes 203609