LibriVox / librivox-catalog

LibriVox catalog and reader workflow application
https://librivox.org
MIT License
36 stars 17 forks source link

Finding librivox recordings for a project gutenberg book #200

Closed doug-wade closed 2 months ago

doug-wade commented 4 months ago

I'm working on a tool to try to find the top x books from project gutenberg matching a search term or topic, and I'm having trouble with false positives -- books that appear in the project gutenberg catalog and in the librivox catalog, but that, when I search them by their title, I don't get a match. For example, when I run the tool for all shelves that contain the substring children, the first result is "A Christmas Carol in Prose; Being a Ghost Story of Christmas by Dickens, Charles (https://www.gutenberg.org/ebooks/46.html.images)". However, in librivox, when I search for this title, I don't get any results, I think because its listed as "A Christmas Carol", rather than "A Christmas Carol in Prose; Being a Ghost Story of Christmas".

I would like to request a new feature, a new search param in the url called projectgutenbergid. I would be able to make a request like:

» curl https://librivox.org/api/feed/audiobooks/?projectgutenbergid\=46\&format\=json\&limit\=1

And get a response like

{
  "books": [
    {
      "id": "140",
      "title": "Christmas Carol",
      "description": "A classic tale of what comes to those whose hearts are hard. In a series of ghostly visits, Scrooge visits his happy past, sees the difficulties of the present, views a bleak future, and in the end amends his mean ways. (Summary written by Kristen McQuillin)",
      "url_text_source": "https://www.gutenberg.org/etext/46",
      "language": "English",
      "copyright_year": "1843",
      "num_sections": "5",
      "url_rss": "https://librivox.org/rss/140",
      "url_zip_file": "https://www.archive.org/download/A_Christmas_Carol/A_Christmas_Carol_64kb_mp3.zip",
      "url_project": "https://en.wikipedia.org/wiki/A_Christmas_Carol",
      "url_librivox": "https://librivox.org/a-christmas-carol-by-charles-dickens/",
      "url_other": "",
      "totaltime": "3:14:29",
      "totaltimesecs": 11669,
      "projectgutenbergid": "46",
      "authors": [
        {
          "id": "91",
          "first_name": "Charles",
          "last_name": "Dickens",
          "dob": "1812",
          "dod": "1870"
        }
      ]
    }
  ]
}

I was looking at the librivox recording details page (for example this one), and I see that in the "links" there is an "online text" link that has the project gutenberg link, which iiuc means we have the data in the database to support such an option, though it might not be 100% accurate, since the librivox folks may have linked to a different version of the online text. (edit: also, it's already in the api as url_text_source 🤣)

If the project is willing to support this feature, I'd be interested in contributing.

redrun45 commented 4 months ago

Well, we don't have much in terms of supporting features, and I can't say how much of a priority it would be for other volunteers, but let's talk.

Just to be sure this angle is covered: I see your edit, is there any chance you could reasonably parse and reconstruct the url_text_source to search by? To the best of my (limited) knowledge, Gutenberg IDs don't exist as separate objects in the database, they would need to be parsed out on one end or the other. :smile:

I'll note that some of our database entries point to the actual text in one of the various formats Gutenberg makes available, along these lines: https://www.gutenberg.org/cache/epub/26090/pg26090-images.html

...but most of them are supposed to link to the book's overview page, like so: https://www.gutenberg.org/ebooks/46

redrun45 commented 3 months ago

@doug-wade - wanted to see if there's something you still need anything from this end, or if that url_text_source is going to do the trick for you. 😃

redrun45 commented 2 months ago

Closing this issue for now. If you didn't get what you needed, do come back and say so. :wink: