c-w / gutenberg

A simple interface to the Project Gutenberg corpus.
Apache License 2.0
322 stars 59 forks source link

Enhanced metadata cache management #38

Closed cpeel closed 8 years ago

cpeel commented 8 years ago

@c-w I'm opening this pull request up mostly to generate discussion about what I've been working on. It needs more testcases as well as more thorough testing of the sqlalchemy+sqlite backend and corner cases.

Background: I'm working to incorporate the Gutenberg package into OpenLibrary to highlight PG ePubs for OpenLibrary records that have PG numbers associated with them. An inability to cleanly specify the cache location and having the cache populate on first use don't really fit into such a service-oriented. This branch is working on encapsulating the metadata cache logic into its own class, giving the package user much more control over how it works.

As noted in the commit, this is backwards compatible with the exception that the cache must be manually populated the first time, otherwise it throws an exception. This is to ensure the element of least surprise ('cause an 18-hour return time on the first call is very surprising)!

c-w commented 8 years ago

See also #31.

c-w commented 8 years ago

Thanks for the pull request. Looks like a great start.

I'm not at all wedded to keeping compatibility with BSD-DB (makes the installation more of a hassle, issues on Windows, etc), so feel free to entirely remove BSD-DB in favour of SQLite.

cpeel commented 8 years ago

Ok, this is as good as I want to make it right now. Unfortunately after getting all the work in to enable use with SQLAlchemy+sqlite it turns out the BSD backend is 3 times faster than the SQLAlchemy+sqlite backend during loading.

Even worse, cache population using SQLAlchemy+sqlite segfaults on a trimmed-down PG catalog with 2024 entries in it (BSD backend completes in 153 seconds on the same dataset). It's certainly possible that pointing SQLAlchemy towards a real DB might be faster and succeed on larger data sets, but that seems very heavy-handed for this use-case.

I think the metadata cache manager is still immensely useful for more fine-grained control over the backend cache, particularly when being used by a web service, but the use of SQLAlchemy+sqlite is not a silver bullet for making it faster. I suspect for that we'd have to move off of an RDF backend altogether and onto a backend (possibly sqlite) with a custom schema.

cpeel commented 8 years ago

@c-w, Is this work still of interest to you? Getting a more fine-grained level of control over the creation and management of the cache is blocking some of PG integration work into openlibrary, so I'm hoping to get this, or something like it, in.

If there are additional things you'd like to see in this branch, please let me know.

c-w commented 8 years ago

@cpeel: Thanks for the pull request and apologies for the delay in getting back to you. Given that all the tests pass and that you added new test, I'm happy to merge this branch. I'll take a deeper look after I once again have more time to devote to this project.