c-w / gutenberg

A simple interface to the Project Gutenberg corpus.
Apache License 2.0
320 stars 60 forks source link

How do I get the id of each book? #105

Closed iamyihwa closed 6 years ago

iamyihwa commented 6 years ago

Hello I have been looking for ways to get ids of each book in an intuitive way.

Getting the id from the webpage of each book doesn't seem to work. When I run 'text = strip_headers(load_etext(17384)).strip()', it says the book doesn't exist.

One way would be to look at catalogs. http://www.gutenberg.org/dirs/GUTINDEX.1996 However these indices are not complete, and there are too many files.

I would like ideally to have a way to search with some keywords, get list of books, then using that title, or identifier, get the text out of it.

c-w commented 6 years ago

Hi @iamyihwa and thanks for reaching out. Did you take a look at the get_etexts method to search for the IDs of texts by criteria such as author, title, etc.? That looks like it might fit your use-case. There's more information on the feature in the README: https://github.com/c-w/gutenberg#looking-up-meta-data

iamyihwa commented 6 years ago

Hi @c-w thanks for your reply. I have just tried using the functions that were in the link you sent. However I receive invalid cache error. I have attached the details below.

from gutenberg.query import get_metadata print(get_etexts('title', 'Moby Dick; Or, The Whale')) # prints frozenset([2701, ...])


AttributeError Traceback (most recent call last) ~/anaconda3/envs/fastai/lib/python3.6/site-packages/gutenberg/acquire/metadata.py in open(self) 63 self.graph.open(self.cache_uri, create=False) ---> 64 self._add_namespaces(self.graph) 65 self.is_open = True

~/anaconda3/envs/fastai/lib/python3.6/site-packages/gutenberg/acquire/metadata.py in _add_namespaces(graph) 131 """ --> 132 graph.bind('pgterms', PGTERMS) 133 graph.bind('dcterms', DCTERMS)

~/anaconda3/envs/fastai/lib/python3.6/site-packages/rdflib/graph.py in bind(self, prefix, namespace, override) 917 """ --> 918 return self.namespace_manager.bind( 919 prefix, namespace, override=override)

~/anaconda3/envs/fastai/lib/python3.6/site-packages/rdflib/graph.py in _get_namespace_manager(self) 330 if self.namespace_manager is None: --> 331 self.__namespace_manager = NamespaceManager(self) 332 return self.namespace_manager

~/anaconda3/envs/fastai/lib/python3.6/site-packages/rdflib/namespace.py in init(self, graph) 280 self.__log = None --> 281 self.bind("xml", "http://www.w3.org/XML/1998/namespace") 282 self.bind("rdf", RDF)

~/anaconda3/envs/fastai/lib/python3.6/site-packages/rdflib/namespace.py in bind(self, prefix, namespace, override, replace) 361 prefix = '' --> 362 bound_namespace = self.store.namespace(prefix) 363 # Check if the bound_namespace contains a URI

~/anaconda3/envs/fastai/lib/python3.6/site-packages/rdflib/plugins/sleepycat.py in namespace(self, prefix) 443 prefix = prefix.encode("utf-8") --> 444 ns = self.__namespace.get(prefix, None) 445 if ns is not None:

AttributeError: 'Sleepycat' object has no attribute '_Sleepycat__namespace'

During handling of the above exception, another exception occurred:

InvalidCacheException Traceback (most recent call last)

in () 1 from gutenberg.query import get_metadata ----> 2 print(get_etexts('title', 'Moby Dick; Or, The Whale')) # prints frozenset([2701, ...]) ~/anaconda3/envs/fastai/lib/python3.6/site-packages/gutenberg/query/api.py in get_etexts(feature_name, value) 55 56 """ ---> 57 matching_etexts = MetadataExtractor.get(feature_name).get_etexts(value) 58 return frozenset(matching_etexts) 59 ~/anaconda3/envs/fastai/lib/python3.6/site-packages/gutenberg/query/extractors.py in get_etexts(cls, requested_value) 40 @classmethod 41 def get_etexts(cls, requested_value): ---> 42 query = cls._metadata()[:cls.predicate():cls.contains(requested_value)] 43 results = (cls._uri_to_etext(result) for result in query) 44 return frozenset(result for result in results if result is not None) ~/anaconda3/envs/fastai/lib/python3.6/site-packages/gutenberg/query/api.py in _metadata(cls) 113 114 """ --> 115 return load_metadata() 116 117 @classmethod ~/anaconda3/envs/fastai/lib/python3.6/site-packages/gutenberg/acquire/metadata.py in load_metadata(refresh_cache) 295 296 if not cache.is_open: --> 297 cache.open() 298 299 return cache.graph ~/anaconda3/envs/fastai/lib/python3.6/site-packages/gutenberg/acquire/metadata.py in open(self) 65 self.is_open = True 66 except Exception: ---> 67 raise InvalidCacheException('The cache is invalid or not created') 68 69 def close(self): InvalidCacheException: The cache is invalid or not created
c-w commented 6 years ago

Did you ensure to create the metadata cache before running the query?

from gutenberg.acquire import get_metadata_cache
cache = get_metadata_cache()
cache.populate()

This should only need to be done once since the results are cached on disk. If this doesn't work for you (due to the BerkelyDB setup on your machine), you can also try using the SQLite cache which works everywhere but is somewhat slower:

from gutenberg.acquire import set_metadata_cache
from gutenberg.acquire.metadata import SqliteMetadataCache

cache = SqliteMetadataCache('/my/custom/location/cache.sqlite')
cache.populate()
set_metadata_cache(cache)

There's more documentation on this here: https://github.com/c-w/gutenberg#looking-up-meta-data

iamyihwa commented 6 years ago

Hi @c-w Thanks it worked with cache trick! However it doesn't seem to work for the moby dick example but not for others.

image

I get something that says it is 'frozenset' .. any clues?

c-w commented 6 years ago

Hi @iamyihwa. The get_etexts function returns an immutable set which is why you're seeing frozenset() being returned for your query: there were no results. There were no results for the query since currently the get_etexts methods assume that you're querying for an exact match, e.g. you know the author's name and want to find all the books they wrote, or you know the name of a book and want to find all the copies in the corpus.

In order to do a fuzzy search on the titles like find all the books where the title contains "math", you might be able to use or adapt this snippet:

from gutenberg.acquire import get_metadata_cache
from gutenberg.query.api import MetadataExtractor

# define search parameters
search_term = 'math'
search_field = 'title'

# get a reference to the metadata graph
cache = get_metadata_cache()
cache.open()
graph = cache.graph

# execute the search
extractor = MetadataExtractor.get(search_field)
results = ((extractor._uri_to_etext(etext), value.toPython())
           for (etext, value) in graph[:extractor.predicate():]
           if search_term.lower() in value.toPython().lower())

# print the first result of the search: (25387, 'Mathematical Essays and Recreations')
result = next(results)
print(result)
iamyihwa commented 6 years ago

Thanks @c-w !! I do get results! :-) What I notice is that is there any way to sort the result like when i do search on the gutenberg website, (here the results are sorted according to the popularity). image

What I would like to do eventually is to get some domain specific texts and do some training on it and use that classifier to later determine the domain of unseen text. My domains of interest are like math, history, etc. I would like to get for example math text books, rather than 'aftermath ... '.

I see to do this sorting by popularity could be one option, if you know of any other way it would be nice!

iamyihwa commented 6 years ago

Hi @c-w I have just tried the function, however with the index that I get, I cannot use it to retrieve the test.

image

I want to get the text out of the book 'Four Lecture on Mathematics', which has the index 29788. (29788, 'Four Lectures on Mathematics, Delivered at Columbia University in 1911') image

However I get error. What am I doing wrong?? Could you have a look?

c-w commented 6 years ago

In order to have meaningful search relevance, I'd suggest to do a rough filtering of the documents using the Gutenberg library and then ingest the document's full text into a real search engine like Elastic Search or Azure Search. In that way you'll get nice disambiguation.

If that approach is too heavy, you can also adjust the query condition in the text search snippet that I sent earlier if search_term.lower() in value.toPython().lower() to add some more checks, for example with a regex match to exclude words that have 'math' not at a word boundary.

Downloading book 29788 fails since it doesn't offer a textual download. I've updated the error message to make this clearer. You can check the available formats for a book like this:

from gutenberg.query import get_metadata

print(get_metadata('formaturi', 29788))
# frozenset({
#    'http://www.gutenberg.org/files/29788/29788-t/29788-t.tex',
#    'http://www.gutenberg.org/files/29788/29788-pdf.pdf',
#    'http://www.gutenberg.org/files/29788/29788-pdf.zip',
#    'http://www.gutenberg.org/ebooks/29788.rdf',
#    'http://www.gutenberg.org/files/29788/29788-t.zip'
# })

In order to download one of these non-textual formats, you can use this snippet:

from gutenberg.acquire.text import _etextno_to_uri_subdirectory
from gutenberg.acquire.text import _GUTENBERG_MIRROR

text = 29788
extension = '-pdf.pdf'

url = '{mirror}/{path}/{text}{extension}'.format(
  mirror=_GUTENBERG_MIRROR,
  path=_etextno_to_uri_subdirectory(text),
  text=text,
  extension=extension)

I'll also open a pull request later to make this functionality available as a single function. The snippet above is now also available via the _format_download_uri_for_extension function in the gutenberg.acquire.text module on master.

iamyihwa commented 6 years ago

Thanks @c-w for quick feedback and ways to make my way through! Yes I could get the URL using the snippet. I will download it using some external tool after that. Thanks!

I have also tested the new function, however _format_download_uri_for_extension didn't work even after the update.

image

image

c-w commented 6 years ago

As I mentioned the new method just was published to master but we haven't made a new PyPI release yet. This means that you'll have to install the package from Github, e.g. via pip install https://github.com/c-w/gutenberg/archive/d3a98dce92daf2c0cac68e142962aef8cd37b9f0.zip.

c-w commented 6 years ago

@iamyihwa Closing this issue since all of your questions seem to have been addressed. Feel free to reopen if you have any additional questions.

iamyihwa commented 6 years ago

@c-w Thanks for the support. Yes right now all the doubts and problems have been solved. Thanks a lot again for all the help! Will surely get back when I have more issues.