Closed c-w closed 8 years ago
I'm looking into the SQLite option.
I was wondering, can't the rdf graph just be stored in memory and pickled? (I found this) --- What is the advantage of having a database?
As of f055ce9, we're now using a SQLite-based cache backend when BSD-DB isn't installed on the user's machine. However, @cpeel found in #38 that the SQLite backend is slow. An investigation of the pickle-based alternative by @andreasvc is therefore still required.
I played around a bit with an in-memory cache using rdflib.IOMemory, but the resource usage is simply too large to be practical: the loaded cache used more than 6 GB of memory on my machine.
Here's the implementation:
From ec2643e5f9017feb6f12802e31156db904777af5 Mon Sep 17 00:00:00 2001
From: Clemens Wolff <clemens.wolff+git@gmail.com>
Date: Fri, 19 Feb 2016 16:50:59 -0800
Subject: [PATCH] Implement in-memory cache
---
gutenberg/acquire/metadata.py | 28 ++++++++++++++++++++++++++++
tests/test_metadata_cache.py | 8 ++++++++
2 files changed, 36 insertions(+)
diff --git a/gutenberg/acquire/metadata.py b/gutenberg/acquire/metadata.py
index 565e221..251d41f 100644
--- a/gutenberg/acquire/metadata.py
+++ b/gutenberg/acquire/metadata.py
@@ -5,6 +5,7 @@
from __future__ import absolute_import
import abc
+import gzip
import logging
import os
import re
@@ -210,6 +211,33 @@ class SqliteMetadataCache(MetadataCache):
return self.cache_uri[len(self._CACHE_URI_PREFIX):]
+class InMemoryMetadataCache(MetadataCache):
+ def __init__(self, cache_location):
+ store = 'IOMemory'
+ MetadataCache.__init__(self, store, cache_location)
+ self.serialization_format = 'xml'
+
+ def populate(self):
+ MetadataCache.populate(self)
+ self._serialize()
+
+ def open(self):
+ self._deserialize()
+ self.is_open = True
+
+ def _serialize(self):
+ with gzip.open(self._local_storage_path, 'wb') as fobj:
+ self.graph.serialize(fobj, self.serialization_format)
+
+ def _deserialize(self):
+ try:
+ with gzip.open(self._local_storage_path, 'rb') as fobj:
+ self.graph = Graph()
+ self.graph.load(fobj, self.serialization_format)
+ except:
+ raise InvalidCacheException('Unable to deserialize cache')
+
+
_METADATA_CACHE = None
diff --git a/tests/test_metadata_cache.py b/tests/test_metadata_cache.py
index 3697d9b..dcd1016 100644
--- a/tests/test_metadata_cache.py
+++ b/tests/test_metadata_cache.py
@@ -13,6 +13,7 @@ from six import u
from gutenberg._util.url import pathname2url
from gutenberg.acquire.metadata import CacheAlreadyExistsException
from gutenberg.acquire.metadata import InvalidCacheException
+from gutenberg.acquire.metadata import InMemoryMetadataCache
from gutenberg.acquire.metadata import SleepycatMetadataCache
from gutenberg.acquire.metadata import SqliteMetadataCache
from gutenberg.acquire.metadata import set_metadata_cache
@@ -112,6 +113,13 @@ class TestSqlite(MetadataCache, unittest.TestCase):
self.cache.catalog_source = _sample_metadata_catalog_source()
+class TestInMemory(MetadataCache, unittest.TestCase):
+ def setUp(self):
+ self.local_storage = "%s.gz" % tempfile.mktemp()
+ self.cache = InMemoryMetadataCache(self.local_storage)
+ self.cache.catalog_source = _sample_metadata_catalog_source()
+
+
def _sample_metadata_catalog_source():
module = os.path.dirname(sys.modules['tests'].__file__)
path = os.path.join(module, 'data', 'sample-rdf-files.tar.bz2')
--
1.9.1
This means that, for now, the best option for the cache backend seems still to be BSD-DB with a SQLite fall-back.
Issues #26 and #28 show that there are some problems with RDFlib's default BSD-DB backend under Python3 and on Windows. This makes it worth to spend some time investigating switching away from BSD-DB and towards an alternative data-store, e.g. SQLite via RDFlib-SQLAlchemy.