c-w / gutenberg

A simple interface to the Project Gutenberg corpus.
Apache License 2.0
322 stars 59 forks source link

Additional RDF metadata extractors for language, subject(s), and copyright #50

Closed ikarth closed 8 years ago

ikarth commented 8 years ago

I've added extractors for more metadata information. As an example, here's the results for Moby Dick:

get_metadata('title', 2701) : frozenset({'Moby Dick; Or, The Whale'})
get_metadata('author', 2701) : frozenset({'Melville, Hermann'})
get_metadata('rights', 2701) : frozenset({'Public domain in the USA.'})
get_metadata('subject', 2701) : frozenset({'Adventure stories', 'Sea stories', 'Whales -- Fiction', 'Psychological fiction', 'Whaling ships -- Fiction', 'Ahab, Captain (Fictitious character) -- Fiction', 'Whaling -- Fiction', 'Mentally ill -- Fiction', 'Ship captains -- Fiction', 'PS'})
get_metadata('language', 2701) : frozenset({rdflib.term.Literal('en', datatype=rdflib.term.URIRef('http://purl.org/dc/terms/RFC4646'))})
ikarth commented 8 years ago

Not sure why the integration check is failing...looks like it may be related to a prior issue with requirements.pip?

MasterOdin commented 8 years ago

Would it make more sense to just return a string for language instead of an RDF literal or is knowing the URIRef important?

The failing python versions on Travis are resolved by #48 and #49

ikarth commented 8 years ago

I think it would make more sense to return a string. I just couldn't work out a way to do it without completely rewriting SimplePredicateRelationshipExtractor.

MasterOdin commented 8 years ago

So I finally got around to looking into this and we do not need to rewrite SimplePredicateRelationshipExtractor to get a nice string type.

So the way RDFLib works is that when you create a literal, you can pass in a datatype for it. However, the problem here is that the datatype that Gutenberg uses for Language is not recognized by RDFLib, which is causing this behavior. By default, RDFLib only supports mappings for XSD types. Thus all of the extractors except for Language either don't give a datatype (at which point the Literal is just assumed to be a string) or one of those value XSD types. Unfortunately, if you pass in an "unknown" type (such as rdflib.term.URIRef('http://purl.org/dc/terms/RFC4646')), RDFLib fails to do anything when you attempt to parse the literal to some python.

This can then be resolved by adding this type to RDFLib's mapping (in gutenberg/query/api.py, right after the imports) using rdflib.term.bind(rdflib.term.URIRef('http://purl.org/dc/terms/RFC4646'), unicode) (you should import both of these as necessary from rdflib.term) happily then giving us a string instead of literal:

>>> from gutenberg.query import get_metadata
INFO:rdflib:RDFLib Version: 4.2.1
>>> print(get_metadata('language', 2701))
frozenset([u'en'])
ikarth commented 8 years ago

Simple enough to fix, then.

As a side note, the extra metadata extractors make it easy to get the set of all subjects in the PG books, from 'Abolitionists -- United States -- Biography' to 'Youth -- Conduct of life -- Juvenile fiction'.

MasterOdin commented 8 years ago

That's what the library is there for, to make it easy to get info from the Gutenberg collection.