c-w / gutenberg

A simple interface to the Project Gutenberg corpus.
Apache License 2.0
318 stars 60 forks source link

API v2 #11

Closed c-w closed 9 years ago

c-w commented 9 years ago

Now that we have a couple of use-cases for the Gutenberg library, I figure it would be a good time to refactor the API. The current implementation was okay to get off the ground, however, the current API also turns out to be hard to test and over-complicated.

@MasterOdin, @sethwoodworth:

Currently I'm thinking of a really simple API:

type Text = String
type Attribute = (String, String)
type Uid = Integer

_text_for_uid :: Uid -> Text -- returns the text for a given UID
_attributes_for_uid :: Uid -> [Attribute] -- returns the properties of a given UID e.g. author, 

texts_for_attribute :: Attribute -> [Text] -- returns all the texts for a given attribute-value combination such as ('author', 'Jules Verne'), ('year', '1996') or ('title', 'Moby Dick')

_attributes_for_uid parses the Gutenberg meta-data tarball using RDF to extract all attributes for all the texts. _text_for_uid takes care of downloading the texts (and/or storing them on disk). The last function is trivially defined in terms of the first two.

MasterOdin commented 9 years ago

I like the concept of the attribute-value pair as that does give you more flexibility in adding more metadata over time.

My use-case for this would be:

  1. I'd like to specify author (either partial or exact match. If I say 'United States', I may not mean 'United States Government'), language, and title. I should be able to use as many or few of these pairs as possible. So perhaps instead of texts_for_attribute, have something like texts_for_attributes :: [Attribute1, Attribute2, ...] -> [Text] which then appends the appropriate where clauses to the search. Of course, I'm not sure the best way to approach that as well as allow for specifying partial/exact searching on some attributes when specifying multiple ones.
  2. I'd like to probably only get texts (so no sounds, etc. as I'm only interested in NLP). This can be wrapped up in the above comment with an Attribute for type though. (See #9)
  3. Have some mechanism for maintaining a DB/rdf store and only adding in new rdfs (even if it's just a simple "only get uids that are greater than the last valid uid in the database" but I'm not sure if it's always upward sequentially on the uid for new books.
sethwoodworth commented 9 years ago

I need to do the following:

  1. Given a book_id, get a dict containing: author, title, pub_year, language, &etc
  2. Use your header/footer parser to cleanup all the PG texts in my github fork

I was currently intending on serializing the output of your parser to a psql database and do my own querying. But I haven't dug into what or how I want to search and am 100% willing to be convinced otherwise.

c-w commented 9 years ago

Thanks for your inputs. I've finally found some time to work on this again.

There are really three things that this package is doing, which was causing my issues with the initial version of the API:

  1. Get texts from Project Gutenberg.
  2. Search the meta-data of the texts.
  3. Clean up the texts.

I'll re-structure the API along those lines.

For now, I've split out the bit of code that post-processes the Project Gutenberg texts to remove the legal disclaimers and other headers/footers (use-case 3. above). This is the most mature part of the project and quite usable on its own.

The split-out package is gutenberg_cleaner. @sethwoodworth - you might want to have a look at this new package and let me know of any texts where the headers/footers are not correctly parsed out correctly.

sethwoodworth commented 9 years ago

I'll take a look! Thank you

c-w commented 9 years ago

I'm almost code complete on a clean-room re-implementation of the library with a much clearer and more testable API. Below is a preview of the new API - feel free to leave a comment if you feel strongly about any of the things you see.

Using the new API is a lot simpler than previously.

def load_etext(etextno):
    """Returns a unicode representation of the Project Gutenberg text with the given identifier.

    """
    pass

def get_metadata(feature_name, etextno):
    """Looks up a meta-data value for a particular e-text.

    >>> get_metadata('title', 2701)
    u'Moby Dick; or The Whale'

    >>> get_metadata('author', 2701)
    u'Melville, Herman'

    """
    pass

def get_etexts(feature_name, value):
    """Looks up all the texts that have a particular meta-data value.

    >>> get_etexts('title', 'Moby Dick; or The Whale')
    [2701]

    >>> get_etexts('author', 'Melville, Herman')
    [15, 1900, 2489, 2694, 2701, 4045, 8118, 9146, 9147, 9268, 9269, 10712, 11231, 12384, 12841, 13720, 13721, 15422, 15859, 21816, 23969, 28656, 28794, 34970]

    """
    pass

The get_{etexts,metadata} functions reflectively delegate work to MetadataExtractor objects. This means that extending the API with new meta-data features is as simple as creating a new sub-class of MetadataExtractor and loading it into the Python run-time.

class MetadataExtractor(object):
    _metadata = load_metadata()  # this is the full Project Gutenberg meta-data RDF graph

    @abstractlassmethod
    def feature_name(cls): pass

    @abstractlassmethod
    def get_metadata(cls, etextno): pass

    @abstractlassmethod
    def get_etexts(cls, feature_value): pass
MasterOdin commented 9 years ago

I would argue that it would be important for get_etexts to not only support one feature/value pair but potentially multiple.

A potential use case that illustrates this would be how would I get only the german books by the author "Various".

The solution as proposed by your API would be:

texts = get_etexts('author','various')
final_list = []
for text in texts:
    if get_metadata('language', text) != 'german':
        pass
    final_list.append(text)

which is somewhat weird as I'd kind of like the API to handle this internally (especially if I want to get even more specific with criteria and don't want to build up that if statement!)

So I'd say maybe change get_etexts to support passing in either two strings for one feature, or probably easier, a dictionary which would allow for any number of arguments:

texts = get_etexts({'author':'various','language':'german'})

I'm also curious to how exactly load_metadata would work as get_metadata still making some sort of SPARQL call or is it rather loading the SPARQL data into some sort of dictionary/list structure that's a bit easier to parse straight in python, which would allow for probably a slightly better understanding of how I'd go about extending the MetadataExtractor class.

sethwoodworth commented 9 years ago

Django makes chaining query filters really easy. abstractclassmethods return self, which allows for chaining of .filter() commands.

texts = MetadataExtractor.get_etexts('author', 'various').get_etexts('language', 'german')
c-w commented 9 years ago

Query composition

How about having the get_etexts return a set of texts instead of a list? This would be more semantically correct (we're returning a set of results that inherently is without order) and would allow simple query composition without needing to add any special mechanisms.

texts = get_etexts('author', 'various') & get_etexts('language', 'german')

Metadata

The load_metadata function returns an rdflib.Graph of all the information in the Project Gutenberg meta-data catalog. The graph supports both SPARQL queries and indexing based on triple-matching. Currently loading the meta-data graph is quite slow (MVP implementation de-serializes the graph from disk into memory which has to load ~100MB of compressed RDF data, ~800MB uncompressed). Going forward, there are a number of ways to improve performance (e.g. use a graph backed by an on-disk database instead of having to load the entire graph into memory).

Here is an example of how we'd write a MetadataExtractor that gets title information.

from rdflib.namespace import DCTERMS
from rdflib.term import URIRef

def etext_to_uri(etext):
    uri = r'http://www.gutenberg.org/ebooks/{}'.format(etext)
    return URIRef(uri)

def uri_to_etext(uri):
    return int(os.path.basename(uri.toPython()))

class TitleExtractor(MetadataExtractor):
    @classmethod
    def feature_name(cls):
        return 'title'

    @classmethod
    def get_metadata(cls, etextno):
        query = cls._metadata[etext_to_uri(etextno) : DCTERMS.title : ]  # finds all the triples where the subject==etext and the predicate==title
        try:
            return next(query).toPython()
        except StopIteration:
            return None

    @classmethod
    def get_etexts(cls, feature_value):
        query = cls._metadata[ : DCTERMS.title : ]  # find all the triples where the predicate==title
        return [uri_to_etext(uri) for uri, value in query
                if value.toPython() == feature_value]
MasterOdin commented 9 years ago

I guess my primary concern would be speed of doing multiple queries/finding triples against the entire Gutenberg graph (as you mentioned it was hundreds of MB of data into memory) and then doing set/list manipulations as opposed to setting up one query within SPARQL (assuming it's possible, as I'm not 100% sure as I haven't played with it enough) which would return the wanted result.

@sethwoodworth That would work if get_etexts(cls, feature_value) returned a new MetadataExtractor with a smaller RDF graph, that then got filtered, and so on and so on which would be easy to do with SQL queries, but I have no idea on RDF graphs on how to do that nor the speed impact of chained queries like that into new RDF graphs.