Closed chriswait closed 7 years ago
Hi @chriswait. Thanks for getting in touch.
One way to implement fuzzy matching would be to provide an alternative implementation of MetadataExtractor.get_etexts that uses Sparql instead of tuple-indexing for retrieval.
Let me know what you find out!
@chriswait Given that it's been over a year, I'm going to close this issue. If you're still interested in implementing this (and need some pointers on how to get started), please get in touch.
Note that we've recently made querying by author name more like what you'd expect (see #43) which should alleviate the problem you mentioned: querying for "Melville, Herman" now returns the correct results.
@c-w thanks for following this up! It looks like #43 is indeed pretty close to what I wanted at the time.
@c-w I'm hoping to get texts by subject with fuzzy matching rather than knowing the exact matches for subjects... so does your suggestion from two years ago about
One way to implement fuzzy matching would be to provide an alternative implementation of MetadataExtractor.get_etexts that uses Sparql instead of tuple-indexing for retrieval.
still applies?
Thanks!
@wwymak Yes that suggestion still applies.
It seems that at different times in this project's development, fuzzy searches on an author's name have been discussed and were even implemented at one stage (using
WHERE author LIKE
in https://github.com/c-w/Gutenberg/blob/710052ce5cab7ea45b101ab756c7f1b29091236a/gutenberg/corpus.py#L48), so I apologise if this is redundant.My use case for this project is essentially grabbing the text content of texts written by a given author. Judging by the README, the functionality provided by `get_etexts("author", "Melville, Hermann") is able to provide the file IDs, which can then be easily downloaded.
However,
get_etexts("author", "Melville, Herman")
(i.e dropping the second "n", as the author's name is actually shown at http://www.gutenberg.org/ebooks/author/9) returns an empty frozenset.Should I assume that this project currently does not support this feature? If not, I'd be happy to contribute, but the current RDF-based implementation isn't something I currently have any experience with.