Closed c-w closed 9 years ago
Just so I know what you mean by this: genre - what Gutenberg calls "Subject". However, seems like most books have multiple "subject" categories (1 or more LoC categorization (we can only get the name of the broad category, and nothing more specific) and 1 or more 'subjects'). How to represent this within the database will be the biggest hurdle (another table for subjects that link back to the main table with a row per subject?). The best example is the Magna Carta.
publication date - what date it became available on Gutenberg (as no other date exists)
The only other useful thing you can pull out of the rdf would be language of the text and maybe license.
Hi Matthew - since I haven't worked on this project in a while: let me take some time to look into this and I'll get back to you soon.
Can I get a few more pointers about getting this set up? I'd like to try adding SPARQL to grab the licensing terms:
ebook = next(iter(rdf_graph.query('''
SELECT
?ebook
?author
?title
?rights
WHERE {
?ebook a pgterms:ebook.
OPTIONAL { ?ebook dcterms:creator [ pgterms:name ?author ]. }
OPTIONAL { ?ebook dcterms:title ?title. }
OPTIONAL { ?ebook dcterms:rights ?rights.}
}
LIMIT 1
''')))
What is the best way to go about testing this with range of sample RDF files. Are there structures in place for this?
@rdhyee and I are very interested in extending your library for GITenberg.
@sethwoodworth - I'm currently refactoring the way in which meta-data is extracted and exposed through the library: see #11. The refactor will make it a lot easier to extend the library (more testable, less convoluted, etc.), so I'd recommend you to hold off working on this issue until the refactor is live.
My first implementation of the library wasn't particularly good - classic "works for me in this very particular situation" code, not particularly fit for general consumption. I was somewhat surprised that this library is actually useful to people. This refactor will hopefully fix the initial design flaws and enable people such as yourself to be productive with the library.
The bulk of the refactor is done - I expect the changes to go live towards the end of the week.
Fantastic. I will keep my eyes open for those changes. In the meantime, @rdhyee make a gist with the SPARQL needed for more metadata properties
Preview:
qres = g.query('''
SELECT
?ebook
?author
?author_webpage
?title
?friendlytitle
?language
?description
?rights
?type
?issued
?downloads
WHERE {
?ebook a pgterms:ebook.
OPTIONAL { ?ebook dcterms:creator [ pgterms:name ?author ]. }
OPTIONAL { ?ebook dcterms:creator [ pgterms:webpage ?author_webpage]. }
OPTIONAL { ?ebook dcterms:title ?title. }
OPTIONAL { ?ebook dcterms:friendlytitle ?friendlytitle. }
OPTIONAL { ?ebook dcterms:language [rdf:value ?language] .}
OPTIONAL { ?ebook dcterms:description ?description.}
OPTIONAL { ?ebook dcterms:rights ?rights.}
OPTIONAL { ?ebook dcterms:type [ rdf:value ?type].}
OPTIONAL { ?ebook dcterms:issued ?issued.}
OPTIONAL { ?ebook pgterms:downloads ?downloads.}
}
LIMIT 1
''')
Currently this library only makes use of the author and title meta-data exposed by Project Gutenberg and does not leverage information such as genre, publication date, etc.
Making this information usable by the library is a pretty straight forward three-step process:
TextSource.textinfo_converter
method needs to be extended to parse the new meta-data attributes.TextInfo
class.Corpus
class (such astexts_for_genre
ortexts_for_year