c-w / gutenberg

A simple interface to the Project Gutenberg corpus.
Apache License 2.0
322 stars 59 forks source link

Migrate RDFlib backend away from BSD-DB #31

Closed c-w closed 8 years ago

c-w commented 8 years ago

Issues #26 and #28 show that there are some problems with RDFlib's default BSD-DB backend under Python3 and on Windows. This makes it worth to spend some time investigating switching away from BSD-DB and towards an alternative data-store, e.g. SQLite via RDFlib-SQLAlchemy.

lulu-berlin commented 8 years ago

I'm looking into the SQLite option.

I was wondering, can't the rdf graph just be stored in memory and pickled? (I found this) --- What is the advantage of having a database?

c-w commented 8 years ago

As of f055ce9, we're now using a SQLite-based cache backend when BSD-DB isn't installed on the user's machine. However, @cpeel found in #38 that the SQLite backend is slow. An investigation of the pickle-based alternative by @andreasvc is therefore still required.

c-w commented 8 years ago

I played around a bit with an in-memory cache using rdflib.IOMemory, but the resource usage is simply too large to be practical: the loaded cache used more than 6 GB of memory on my machine.

Here's the implementation:

From ec2643e5f9017feb6f12802e31156db904777af5 Mon Sep 17 00:00:00 2001
From: Clemens Wolff <clemens.wolff+git@gmail.com>
Date: Fri, 19 Feb 2016 16:50:59 -0800
Subject: [PATCH] Implement in-memory cache

---
 gutenberg/acquire/metadata.py | 28 ++++++++++++++++++++++++++++
 tests/test_metadata_cache.py  |  8 ++++++++
 2 files changed, 36 insertions(+)

diff --git a/gutenberg/acquire/metadata.py b/gutenberg/acquire/metadata.py
index 565e221..251d41f 100644
--- a/gutenberg/acquire/metadata.py
+++ b/gutenberg/acquire/metadata.py
@@ -5,6 +5,7 @@
 from __future__ import absolute_import

 import abc
+import gzip
 import logging
 import os
 import re
@@ -210,6 +211,33 @@ class SqliteMetadataCache(MetadataCache):
         return self.cache_uri[len(self._CACHE_URI_PREFIX):]

+class InMemoryMetadataCache(MetadataCache):
+    def __init__(self, cache_location):
+        store = 'IOMemory'
+        MetadataCache.__init__(self, store, cache_location)
+        self.serialization_format = 'xml'
+
+    def populate(self):
+        MetadataCache.populate(self)
+        self._serialize()
+
+    def open(self):
+        self._deserialize()
+        self.is_open = True
+
+    def _serialize(self):
+        with gzip.open(self._local_storage_path, 'wb') as fobj:
+            self.graph.serialize(fobj, self.serialization_format)
+
+    def _deserialize(self):
+        try:
+            with gzip.open(self._local_storage_path, 'rb') as fobj:
+                self.graph = Graph()
+                self.graph.load(fobj, self.serialization_format)
+        except:
+            raise InvalidCacheException('Unable to deserialize cache')
+
+
 _METADATA_CACHE = None

diff --git a/tests/test_metadata_cache.py b/tests/test_metadata_cache.py
index 3697d9b..dcd1016 100644
--- a/tests/test_metadata_cache.py
+++ b/tests/test_metadata_cache.py
@@ -13,6 +13,7 @@ from six import u
 from gutenberg._util.url import pathname2url
 from gutenberg.acquire.metadata import CacheAlreadyExistsException
 from gutenberg.acquire.metadata import InvalidCacheException
+from gutenberg.acquire.metadata import InMemoryMetadataCache
 from gutenberg.acquire.metadata import SleepycatMetadataCache
 from gutenberg.acquire.metadata import SqliteMetadataCache
 from gutenberg.acquire.metadata import set_metadata_cache
@@ -112,6 +113,13 @@ class TestSqlite(MetadataCache, unittest.TestCase):
         self.cache.catalog_source = _sample_metadata_catalog_source()

+class TestInMemory(MetadataCache, unittest.TestCase):
+    def setUp(self):
+        self.local_storage = "%s.gz" % tempfile.mktemp()
+        self.cache = InMemoryMetadataCache(self.local_storage)
+        self.cache.catalog_source = _sample_metadata_catalog_source()
+
+
 def _sample_metadata_catalog_source():
     module = os.path.dirname(sys.modules['tests'].__file__)
     path = os.path.join(module, 'data', 'sample-rdf-files.tar.bz2')
-- 
1.9.1

This means that, for now, the best option for the cache backend seems still to be BSD-DB with a SQLite fall-back.