c-w / gutenberg

A simple interface to the Project Gutenberg corpus.
Apache License 2.0
322 stars 59 forks source link

Expose more meta-data #2

Closed c-w closed 9 years ago

c-w commented 10 years ago

Currently this library only makes use of the author and title meta-data exposed by Project Gutenberg and does not leverage information such as genre, publication date, etc.

Making this information usable by the library is a pretty straight forward three-step process:

  1. The TextSource.textinfo_converter method needs to be extended to parse the new meta-data attributes.
  2. The new attributes need to be wired through to the TextInfo class.
  3. A new method leveraging the new meta-data source should be added to the Corpus class (such as texts_for_genre or texts_for_year
MasterOdin commented 9 years ago

Just so I know what you mean by this: genre - what Gutenberg calls "Subject". However, seems like most books have multiple "subject" categories (1 or more LoC categorization (we can only get the name of the broad category, and nothing more specific) and 1 or more 'subjects'). How to represent this within the database will be the biggest hurdle (another table for subjects that link back to the main table with a row per subject?). The best example is the Magna Carta.

publication date - what date it became available on Gutenberg (as no other date exists)

The only other useful thing you can pull out of the rdf would be language of the text and maybe license.

c-w commented 9 years ago

Hi Matthew - since I haven't worked on this project in a while: let me take some time to look into this and I'll get back to you soon.

sethwoodworth commented 9 years ago

Can I get a few more pointers about getting this set up? I'd like to try adding SPARQL to grab the licensing terms:

        ebook = next(iter(rdf_graph.query('''                                                                                                                                                                                           
            SELECT                                                                                                                                                                                                                      
                ?ebook                                                                                                                                                                                                                  
                ?author                                                                                                                                                                                                                 
                ?title                                                                                                                                                                                                                  
                ?rights                                                                                                                                                                                                                 
            WHERE {                                                                                                                                                                                                                     
                ?ebook a pgterms:ebook.                                                                                                                                                                                                 
                OPTIONAL { ?ebook dcterms:creator [ pgterms:name ?author ]. }                                                                                                                                                           
                OPTIONAL { ?ebook dcterms:title ?title. }                                                                                                                                                                               
                OPTIONAL { ?ebook dcterms:rights ?rights.}                                                                                                                                                                                              
            }                                                                                                                                                                                                                           
            LIMIT 1                                                                                                                                                                                                                     
        '''))) 

What is the best way to go about testing this with range of sample RDF files. Are there structures in place for this?

@rdhyee and I are very interested in extending your library for GITenberg.

c-w commented 9 years ago

@sethwoodworth - I'm currently refactoring the way in which meta-data is extracted and exposed through the library: see #11. The refactor will make it a lot easier to extend the library (more testable, less convoluted, etc.), so I'd recommend you to hold off working on this issue until the refactor is live.

My first implementation of the library wasn't particularly good - classic "works for me in this very particular situation" code, not particularly fit for general consumption. I was somewhat surprised that this library is actually useful to people. This refactor will hopefully fix the initial design flaws and enable people such as yourself to be productive with the library.

The bulk of the refactor is done - I expect the changes to go live towards the end of the week.

sethwoodworth commented 9 years ago

Fantastic. I will keep my eyes open for those changes. In the meantime, @rdhyee make a gist with the SPARQL needed for more metadata properties

Preview:

    qres = g.query('''
                SELECT
                    ?ebook
                    ?author
                    ?author_webpage
                    ?title
                    ?friendlytitle
                    ?language
                    ?description
                    ?rights
                    ?type
                    ?issued
                    ?downloads
                WHERE {
                    ?ebook a pgterms:ebook.
                    OPTIONAL { ?ebook dcterms:creator [ pgterms:name ?author ]. }  
                    OPTIONAL { ?ebook dcterms:creator [ pgterms:webpage ?author_webpage]. }    
                    OPTIONAL { ?ebook dcterms:title ?title. }
                    OPTIONAL { ?ebook dcterms:friendlytitle ?friendlytitle. }
                    OPTIONAL { ?ebook dcterms:language [rdf:value ?language] .}
                    OPTIONAL { ?ebook dcterms:description ?description.}
                    OPTIONAL { ?ebook dcterms:rights ?rights.}
                    OPTIONAL { ?ebook dcterms:type [ rdf:value ?type].}
                    OPTIONAL { ?ebook dcterms:issued ?issued.}
                    OPTIONAL { ?ebook pgterms:downloads ?downloads.}
                }
                LIMIT 1
    ''')