c-w / gutenberg

A simple interface to the Project Gutenberg corpus.
Apache License 2.0
322 stars 59 forks source link

Fuzzy Searches on Author #39

Closed chriswait closed 7 years ago

chriswait commented 8 years ago

It seems that at different times in this project's development, fuzzy searches on an author's name have been discussed and were even implemented at one stage (using WHERE author LIKE in https://github.com/c-w/Gutenberg/blob/710052ce5cab7ea45b101ab756c7f1b29091236a/gutenberg/corpus.py#L48), so I apologise if this is redundant.

My use case for this project is essentially grabbing the text content of texts written by a given author. Judging by the README, the functionality provided by `get_etexts("author", "Melville, Hermann") is able to provide the file IDs, which can then be easily downloaded.

However, get_etexts("author", "Melville, Herman") (i.e dropping the second "n", as the author's name is actually shown at http://www.gutenberg.org/ebooks/author/9) returns an empty frozenset.

Should I assume that this project currently does not support this feature? If not, I'd be happy to contribute, but the current RDF-based implementation isn't something I currently have any experience with.

c-w commented 8 years ago

Hi @chriswait. Thanks for getting in touch.

One way to implement fuzzy matching would be to provide an alternative implementation of MetadataExtractor.get_etexts that uses Sparql instead of tuple-indexing for retrieval.

Let me know what you find out!

c-w commented 7 years ago

@chriswait Given that it's been over a year, I'm going to close this issue. If you're still interested in implementing this (and need some pointers on how to get started), please get in touch.

Note that we've recently made querying by author name more like what you'd expect (see #43) which should alleviate the problem you mentioned: querying for "Melville, Herman" now returns the correct results.

chriswait commented 7 years ago

@c-w thanks for following this up! It looks like #43 is indeed pretty close to what I wanted at the time.

wwymak commented 5 years ago

@c-w I'm hoping to get texts by subject with fuzzy matching rather than knowing the exact matches for subjects... so does your suggestion from two years ago about

One way to implement fuzzy matching would be to provide an alternative implementation of MetadataExtractor.get_etexts that uses Sparql instead of tuple-indexing for retrieval.

still applies?

Thanks!

c-w commented 5 years ago

@wwymak Yes that suggestion still applies.