c-w / gutenberg

A simple interface to the Project Gutenberg corpus.
Apache License 2.0
320 stars 60 forks source link

extracting all etext numbers, titles, and authors #120

Closed ericleasemorgan closed 5 years ago

ericleasemorgan commented 5 years ago

How can I extract all the etext numbers, titles, and authors from the cache?

In other words, how can I first create a list of all the etext identifiers, titles, or authors, and then loop through the list(s) to download the actual text. Based on the documentation (and my ability to read the actual function calls), I don't see a way to so something like this:

from gutenberg.query import get_metadata
from gutenberg.cleanup import strip_headers
from gutenberg.acquire import load_etext
keys = get_etexts('title', '*')
for key in keys :
  text = strip_headers(load_etext(key)).strip()
  print( text, key + '.txt')

Put yet another way, the system seems to be implemented such that one needs to know an exact etext number, title, or author value before they are able to successfully query the cache.

P.S. I guess I'd really like a list of all the valid etext numbers or authors rather than the titles.

hugovk commented 5 years ago

Does the code/data in https://github.com/hugovk/gutenberg-metadata help?

ericleasemorgan commented 5 years ago

@hugovk, yes, thank you. This helps. It is not exactly what I was seeking, but it is a step the right direction. I have downloaded and am running your code. It is outputting the json file which I will loop through to fill an SQLlite database, and from there do various types of additional indexing.

I appreciate the good work done by @c-w, and I thought about sequentially looping through the etext identifiers starting at #1 and continuing until getting an error, but I was wondering whether or not some identifiers were missing for one reason or another.

Put another way, both of you (@c-w and @hugovk) saved me a lot of time. Thank you.

c-w commented 5 years ago

If I understand correctly, you're essentially looking to get a list of all the etexts and then look up some metadata for each of the etexts? If so, you may also be able to compile a list of all the etext identifiers using a SPARQL query against the metadata graph:

from gutenberg.acquire.metadata import load_metadata

metadata = load_metadata()

results = metadata.query('''
    select ?ebook
    where {
        ?ebook a <http://www.gutenberg.org/2009/pgterms/ebook>.
    }
''')

text_ids = (int(row[0].toPython().replace('http://www.gutenberg.org/ebooks/', '')) for row in results)

print(next(text_ids))  # 25197
print(next(text_ids))  # 25198

According to this snippet, there are 59335 etexts with the minimum etext number being 0 and maximum number being 999999.

Let me know if this helps!

ericleasemorgan commented 5 years ago

The above is interesting and good for future reference, but I'm going to go with a Personal Plan B.

More specifically, I'm going to use the JSON output of https://github.com/hugovk/gutenberg-metadata to create an SQLite database. From there I will loop through the database to index the content using Solr. To "ferment" the Solr index I will probably add full text as well as statistically significant keywords calculated using TFIDF or some other NLP method.

In my copious spare time I will look at the repository in its canonical form -- a triple store. :-)

c-w commented 5 years ago

If you're interested in building a SQLite database, note that gutenberg does already have first-party support for a SQLite backend:

# first, set up gutenberg to use the SQLite backend

from gutenberg.acquire import set_metadata_cache
from gutenberg.acquire.metadata import SqliteMetadataCache

cache = SqliteMetadataCache('/my/custom/location/cache.sqlite')
cache.populate()
set_metadata_cache(cache)

# now you can use the library as normal

from gutenberg.query import get_metadata

print(get_metadata('title', 2701))

Perhaps the schema of the database that gets built in this way may be useful to you. Note that the SQLite backend is quite a bit slower than the default Berkeley DB one though.

If you're ultimately interested in full-text-search, gutenberg also supports Apache Jena Fuseki as a metadata cache backend which integrates directly with Lucene or ElasticSearch. Given that the SQLite backend uses sqlalchemy under the hood, it may also be possible to modify the SQLite backend to also support Postgres as the data-store and then leverage a GIN index to implement full-text-search.

ericleasemorgan commented 5 years ago

Thank you for the prompt reply, and using the existing triple store, extracting the metadata, and then filling my own local SQL database seem robust and straight forward.