c-w / gutenberg

A simple interface to the Project Gutenberg corpus.
Apache License 2.0
322 stars 59 forks source link

Performance optimization of metadata loading #15

Closed c-w closed 9 years ago

c-w commented 9 years ago

The load_metadata function that makes the Project Gutenberg meta-data RDF graph available to the meta-data extractors takes for ever to run as it is loading ~130MB of data into memory from a flat file (~850MB uncompressed).

A low-cost way to address this is to investigate database-backed stores for the RDF graph instead of loading it all into memory.

Alternative ways to tackle the issue include sharding the meta-data file along multiple dimensions (e.g. text identifier, author, etc.) but this would require adding code to create the pertinent shards to every MetadataExtractor i.e., extending the library involves more work in the future.

c-w commented 9 years ago

This just in: with the code in 1ccab87, it takes about 17.5 hours to execute load_metadata for the first time (time to create the Sleepycat database from scratch from the Project Gutenberg RDF meta-data dump, measured on my Samsung Chromebook.). The size of the database is about 2.7GB.

After we the database is created, loading it with load_metadata is instantaneous.

TODO: investigation of query speed once the database is loaded.

c-w commented 9 years ago

The get_metadata calls are essentially instantaneous.

The get_etexts calls are pretty slow. This is likely caused by their implementation and not by the underlying database - see #17.

This is enough evidence for me to integrate the Berkeley DB meta-data backend and consider this issue fixed.