DobyTang / LazyLibrarian

This project isn't finished yet. Goal is to create a SickBeard, CouchPotato, Headphones-like application for ebooks. Headphones is used as a base, so there are still a lot of references to it.
730 stars 70 forks source link

Goodreads matching quesstion #549

Closed WillowMist closed 7 years ago

WillowMist commented 7 years ago

Does LL read the metadata.db when scanning the library? If it does, would it be possible to pull the goodreads:#### id, and use that to match against Goodreads, then it could import a book even if it doesn't match the language or naming.

philborman commented 7 years ago

We get the goodreads id by asking goodreads for a list of all the authors books. sadly the list doesn't include language. We get name, publishing date, number of pages, rating and other info, but not language.

We try to read the language from the book embedded metadata or .opf file, or if that fails we derive the language from the ISBN code.

If neither of those gives us anything we could try another call to goodreads to get the individual book page which often contains a language code, but goodreads api has strict limits so would have to be a last resort. One call per author isn't too bad, one call per book would exceed their daily limit, and also be very slow as we are limited to one call per second.

We could query LibraryThing for a list of other ISBNs for the book, one of those might provide a language code? They have similar limits, but we cache the isbn codes so that would rapidly reduce the number of hits.

I will have a play...

WillowMist commented 7 years ago

Sorry, I was talking about a book that is already in a calibre library. If you've used the Goodreads metadata plugin, the opf should have a tag that looks like this:

18630686 My question is, when scanning a book folder, could that file be checked, and then use that ID to pull the info from GoodReads, with (I'm guessing) https://www.goodreads.com/book/show.xml?key=&id=18630686
philborman commented 7 years ago

We check that metadata file, but it doesn't always have a language code in it. If there's no language in it we check the goodreads page using the goodreads id as you guessed, but that does not always return a language in the xml either.

On Thu, 22 Dec 2016, 13:16 DarkSir23, notifications@github.com wrote:

Sorry, I was talking about a book that is already in a calibre library. If you've used the Goodreads metadata plugin, the opf should have a tag that looks like this:

18630686 My question is, when scanning a book folder, could that file be checked, and then use that ID to pull the info from GoodReads, with (I'm guessing) https://www.goodreads.com/book/show.xml?key=&id=18630686 — You are receiving this because you commented. Reply to this email directly, view it on GitHub , or mute the thread .
WillowMist commented 7 years ago

Ah, I see where we're not connecting. Language isn't relevant to my question, I just mentioned it because you said several books don't include language. I just mean as a means of scraping a book, and adding it to the library if it doesn't exist, regardless of if it has a language set or not. If you have the goodreads id already, the book match should be essentially already done for you. You're just pulling meta from goodreads.

WillowMist commented 7 years ago

Followup: I'm guessing that currently LibraryScan doesn't pull from GoodReads except when importing the author, then book matching itself is done against the internal database, not going out to goodreads. Which makes sense... So... Is there a field that the GoodReads ID could be stored in when the author is imported? Then:

When matching a book, if there's a goodreads ID in metadata, match against that. If it doesn't find it, try pulling the book from GoodReads, adding it to the appropriate author, and linking the book. If that doesn't work, or the GoodReads ID doesn't exist, proceed as normal, start title matching.

I'm looking at the code, but I'm not super familiar with dealing with API results in python.

philborman commented 7 years ago

We use the goodreads author xml pages to add books to the library. This gives us a list of the authors books with their goodreads IDs and some other info, but we reject any that don't match the "user prefs", which may include users preferred languages. If you're getting an author imported correctly, but not all of their books, missing or incorrect language is the most likely reason.

Other common reasons: - in no particular order

Books with multiple authors (though that seems to be largely resolved by matching to the first named author like Calibre does)

Incorrect data at goodreads (eg books attributed to the wrong author)

More than one author with that name at goodreads, eg there are several Brian Cox

More than one naming of an author at goodreads, eg James Lovelock and James E Lovelock (who are the same person) so some books are under one name, some under the other name, and some under both!

Incorrect or incomplete metadata embedded in the book and/or metadata.opf

Goodreads doesn't have the book listed

I've found another site that gives more info on matching book isbn to language, which might help a bit, but not all books have an isbn provided either :-( It's all a bit of a mess.

On 22/12/16 14:12, DarkSir23 wrote:

Ah, I see where we're not connecting. Language isn't relevant to my question, I just mentioned it because you said several books don't include language. I just mean as a means of scraping a book, and adding it to the library if it doesn't exist, regardless of if it has a language set or not. If you have the goodreads id already, the book match should be essentially already done for you. You're just pulling meta from goodreads.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/DobyTang/LazyLibrarian/issues/549#issuecomment-268796749, or mute the thread https://github.com/notifications/unsubscribe-auth/ALmHOXuSPxZY2irgOh6k1qgoaHA-_xiDks5rKncrgaJpZM4LTn4E.

WillowMist commented 7 years ago

I think goodreads is still the best resource, despite its inconsistencies. I'm just trying to find a way to match books that either have weird naming (Example, Doctor Who: The Time War still imports as Doctor Who, with a subtitle of The Time War, and other books titled Doctor Who: don't appear to get imported). This way if your metadata has a goodreads ID that isn't in the database already, LL can pull it in anyway.

So, it looks like the bookid field in the database is the goodreads ID, if you use GoodReads as your API... It looks like an import_book function could be added to gr.py that just pulls data by book ID, and adds it to the database. Now, I don't know if you'd need to verify the author exists in the DB first....

philborman commented 7 years ago

Inline quoting below, bad practice, I know :-)

On 22/12/16 14:26, DarkSir23 wrote:

Followup: I'm guessing that currently LibraryScan doesn't pull from GoodReads except when importing the author, then book matching itself is done against the internal database, not going out to goodreads. Which makes sense... So... Is there a field that the GoodReads ID could be stored in when the author is imported? Then:

Yes, that's correct. But see earlier reply, we don't store all books when importing the author, only the ones that match prefs. Goodreads ID is the bookID, unless you imported from googlebooks in which case it's the googlebooks ID

When matching a book, if there's a goodreads ID in metadata, match against that. If it doesn't find it, try pulling the book from GoodReads, adding it to the appropriate author, and linking the book. If that doesn't work, or the GoodReads ID doesn't exist, proceed as normal, start title matching.

Yes, that's do-able. If we don't get a match in our database, try goodreads again ignoring language or other prefs. That's how the manual import works if you add a book from a search result.

I'm looking at the code, but I'm not super familiar with dealing with API results in python.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/DobyTang/LazyLibrarian/issues/549#issuecomment-268799106, or mute the thread https://github.com/notifications/unsubscribe-auth/ALmHOZWCDBdLk6R5N9IJplf9KEcpZHjYks5rKnqfgaJpZM4LTn4E.

WillowMist commented 7 years ago

Bad practice, but I'm still a fan of inline quoting :)

So the Import Book is already there, obviously, on manual search.

I'm happy to try to help, if there's much coding involved. Like I said, I'm still making sense of the api results in Python, but it's starting to get clearer. I'm just thinking this will make Calibre integration a lot smoother :) Hope I'm not asking too much.

philborman commented 7 years ago

On 22/12/16 14:36, DarkSir23 wrote:

I think goodreads is still the best resource, despite its inconsistencies.

I agree...

I'm just trying to find a way to match books that either have weird naming (Example, Doctor Who: The Time War still imports as Doctor Who, with a subtitle of The Time War, and other books titled Doctor Who: don't appear to get imported). This way if your metadata has a goodreads ID that isn't in the database already, LL can pull it in anyway.

Yes, but because goodreads doesn't separate out the subtitle we will still end up calling it Doctor Who, with a subtitle of The Time War We could maybe add an "exclusion list" to the splitter, so we don't split certain titles, ie anything starting with "Doctor Who:"

So, it looks like the bookid field in the database is the goodreads ID, if you use GoodReads as your API... It looks like an import_book function could be added to gr.py that just pulls data by book ID, and adds it to the database.

Already there for manual book importing, but title will still be split

Now, I don't know if you'd need to verify the author exists in the DB first....

Yes you do, but I think we handle that automatically if the author isn't known, I'd have to check.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/DobyTang/LazyLibrarian/issues/549#issuecomment-268800785, or mute the thread https://github.com/notifications/unsubscribe-auth/ALmHObVNvQD595Fwb2s2DnWdCBY98abSks5rKnzxgaJpZM4LTn4E.

philborman commented 7 years ago

No, I'm enjoying the challenge!

I did the changes manually on my database using calibre. I had several P.G. Wodehouse books that started with "Plum Punch: " eg "Plum Punch: Life a Home"

so I just altered them in Calibre to remove the colon, eg "Plum Punch - Life at Home" which side-stepped the problem. Only half a dozen books were affected so it was a simple fix.

The only other similar ones in my library were book author included in title, and that's easy to strip out, eg "Tom Clancy: Net Force" where we just strip the title to "Net Force"

On 22/12/16 14:41, DarkSir23 wrote:

Bad practice, but I'm still a fan of inline quoting :)

So the Import Book is already there, obviously, on manual search.

I'm happy to try to help, if there's much coding involved. Like I said, I'm still making sense of the api results in Python, but it's starting to get clearer. I'm just thinking this will make Calibre integration a lot smoother :) Hope I'm not asking too much.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/DobyTang/LazyLibrarian/issues/549#issuecomment-268801664, or mute the thread https://github.com/notifications/unsubscribe-auth/ALmHOYBwM3Qn1UDxJ9ZcKg7zoN9JTNO2ks5rKn33gaJpZM4LTn4E.

WillowMist commented 7 years ago

I'm not as worried about the title splitting, as long as the book actually imports. I can always go in and clean up titles as time allows. That reminds me, could you add the Edit Book link to the Books page display, as well as the Author individual pages?

philborman commented 7 years ago

Lack of import is probably down to the languages again

I think there's a reason there isn't an edit option on the books page, maybe not enough info in that function call. I'd have to check.

On Thu, 22 Dec 2016, 14:47 DarkSir23, notifications@github.com wrote:

I'm not as worried about the title splitting, as long as the book actually imports. I can always go in and clean up titles as time allows. That reminds me, could you add the Edit Book link to the Books page display, as well as the Author individual pages?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/DobyTang/LazyLibrarian/issues/549#issuecomment-268802694, or mute the thread https://github.com/notifications/unsubscribe-auth/ALmHOcIc1jGTOnGj2aT2nqu84pFteN5_ks5rKn9RgaJpZM4LTn4E .

WillowMist commented 7 years ago

Oh, I just noticed that the book name splitting in find_book needs to be updated to match your recent changes in get_author_books , too. Sorry, didn't think it was big enough to make a separate issue of :)

WillowMist commented 7 years ago

I think I see a path to making it work. Mind if I float it here, for critique?

philborman commented 7 years ago

Go ahead, always open to ideas

On Thu, 22 Dec 2016, 15:21 DarkSir23, notifications@github.com wrote:

I think I see a path to making it work. Mind if I float it here, for critique?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/DobyTang/LazyLibrarian/issues/549#issuecomment-268808971, or mute the thread https://github.com/notifications/unsubscribe-auth/ALmHObx5LAZYzyNWBSJa3n1p86UKIthWks5rKoeCgaJpZM4LTn4E .

WillowMist commented 7 years ago

In librarysync/get_book_info :

# In the OPF section

elif 'identifier' in tag and 'goodreads' in attrib:
    res['gr_id'] = txt // Possibly validate data first 

In librarysync/LibraryScan :

# In metafile = opf_file(r)
if 'gr_id' in res:
    gr_id = res['gr_id']

after check_exist_author, change match = myDB.match('SELECT BookID FROM books where BookIsbn = "%s"' % isbn) to:

if 'gr_id' in res:  
    match = myDB.match('SELECT BookID FROM books where BookID = "%s"' % gr_id)
else:
    match = myDB.match('SELECT BookID FROM books where BookIsbn = "%s"' % isbn)

Then below logger.debug('Unable to find bookid %s in database' % bookid)

if 'gr_id' in res:
    #I haven't looked, but whatever procedure webServ.py uses for 'addBook' with the bookid.
WillowMist commented 7 years ago

I see a couple of typos in there. Mostly in my usage of 'gr_id'. Made them all the same. Without trying it, I'm not sure if "if gr_id" will choke on an undefined variable, so I just made the tests against res.

philborman commented 7 years ago

Yes it would choke if undefined. Book name splitting is used in 3 places, so moved it into a separate function, and I've added some code based on your ideas above, but unfortunately can't test it as my calibre generated opf files dont contain the 18630686</dc:identifier> tags. How did you get those generated?

WillowMist commented 7 years ago

I'm using the goodreads metadata plugin. I think it comes with it, I don't recall installing it separately, but I did have to activate it. I can apply the changes locally and test, if you'd like to send me the changes.

philborman commented 7 years ago

I added the plug in and activated it, then rescanned the library, but only 40 or so of my books (out of 1500) have a goodreads link in their opf files. Seems fairly random too, not just new additions.

The latest code is in my git repo if you want to try it. https://github.com/philborman/LazyLibrarian.git

On Thu, 22 Dec 2016, 22:24 DarkSir23, notifications@github.com wrote:

I'm using the goodreads metadata plugin. I think it comes with it, I don't recall installing it separately, but I did have to activate it. I can apply the changes locally and test, if you'd like to send me the changes.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/DobyTang/LazyLibrarian/issues/549#issuecomment-268893620, or mute the thread https://github.com/notifications/unsubscribe-auth/ALmHOR8psRA307IrKXTD4nIDyv6lL42Iks5rKupxgaJpZM4LTn4E .

WillowMist commented 7 years ago

Hmm, I switched to having goodreads be my only scanner, that might be it. Everything I've scanned has gotten it, that I've checked.

WillowMist commented 7 years ago

It's finding a lot more books, that's for sure! I'm moving my library over to all goodreads metadata, as I do, I'll keep you posted. I'd say it's safe to push, as it should only mess with people if they have the GoodReads tags. If this works, I'd like to add suggestions for calibre users to the wiki.

WillowMist commented 7 years ago

Hmm, it is still having issues with several books that start with Doctor Who: -- Updating location.

Example:

22-Dec-2016 16:51:47 - DEBUG :: SCANAUTHOR : [/mnt/collection/Books/Organized/Main/Terrance Dicks] Now scanning subdirectory /Doctor Who_ Endgame (599) 22-Dec-2016 16:51:47 - DEBUG :: SCANAUTHOR : book meta [9780563538226] [en] [Terrance Dicks] [Doctor Who: Endgame] [epub] 22-Dec-2016 16:51:47 - DEBUG :: SCANAUTHOR : file meta [9780563538226] [eng] [Terrance Dicks] [Doctor Who: Endgame] [240950] 22-Dec-2016 16:51:47 - DEBUG :: SCANAUTHOR : Found Language [eng] ISBN [9780563538226] 22-Dec-2016 16:51:47 - DEBUG :: SCANAUTHOR : Already cached Lang [eng] ISBN [056] 22-Dec-2016 16:51:48 - DEBUG :: SCANAUTHOR : Fuzz match partial [99] [Doctor Who: Endgame] [Doctor Who] 22-Dec-2016 16:51:48 - WARNING :: SCANAUTHOR : Updating book location for Terrance Dicks Doctor Who: Endgame from /mnt/collection/Books/Organized/Main/Terrance Dicks/Doctor Who_ The Se eds of Death (320)/Doctor Who_ The Seeds of Death - Terrance Dicks.mobi to /mnt/collection/Books/Organized/Main/Terrance Dicks/Doctor Who_ Endgame (599)/Doctor Who_ Endgame - Terrance Dicks.epub 22-Dec-2016 16:51:48 - DEBUG :: SCANAUTHOR : Terrance Dicks Doctor Who: Endgame matched BookID 678168, [Terrance Dicks][Doctor Who] 22-Dec-2016 16:51:48 - DEBUG :: SCANAUTHOR : [/Doctor Who_ Endgame (599)] already scanned 22-Dec-2016 16:51:48 - DEBUG :: SCANAUTHOR : [/Doctor Who_ Endgame (599)] already scanned

It's getting the bookid: 240950, but then it's saying it matched 678168, so about 30 books are matching to that, and each one is replacing the last's location.

philborman commented 7 years ago

Matching a subset, they all match a book in your database with author Terrance Dicks and book title Doctor Who (after splitting at the colon)

Doctor Who Endgame matched 100% on Doctor Who, then lose a point for an extra word, ending up at 99% match.

Not easy to get around that because mostly the part after the colon is a subtitle. Maybe we can add an exclusion list to lazylibrarian as I suggested before, or maybe just hand edit the lazylibrarian or calibre database to remove/replace the colon on Doctor Who books ( Doctor Who - Endgame will only be a 66% match for Doctor Who then)

On 22/12/16 23:06, DarkSir23 wrote:

Hmm, it is still having issues with several books that start with Doctor Who: -- Updating location.

Example:

|22-Dec-2016 16:51:47 - DEBUG :: SCANAUTHOR : [/mnt/collection/Books/Organized/Main/Terrance Dicks] Now scanning subdirectory /Doctor Who Endgame (599) 22-Dec-2016 16:51:47 - DEBUG :: SCANAUTHOR : book meta [9780563538226] [en] [Terrance Dicks] [Doctor Who: Endgame] [epub] 22-Dec-2016 16:51:47 - DEBUG :: SCANAUTHOR : file meta [9780563538226] [eng] [Terrance Dicks] [Doctor Who: Endgame] [240950] 22-Dec-2016 16:51:47 - DEBUG :: SCANAUTHOR : Found Language [eng] ISBN [9780563538226] 22-Dec-2016 16:51:47 - DEBUG :: SCANAUTHOR : Already cached Lang [eng] ISBN [056] 22-Dec-2016 16:51:48 - DEBUG :: SCANAUTHOR : Fuzz match partial [99] [Doctor Who: Endgame] [Doctor Who] 22-Dec-2016 16:51:48 - WARNING :: SCANAUTHOR : Updating book location for Terrance Dicks Doctor Who: Endgame from /mnt/collection/Books/Organized/Main/Terrance Dicks/Doctor Who The Se eds of Death (320)/Doctor Who The Seeds of Death - Terrance Dicks.mobi to /mnt/collection/Books/Organized/Main/Terrance Dicks/Doctor Who Endgame (599)/Doctor Who Endgame - Terrance Dicks.epub 22-Dec-2016 16:51:48 - DEBUG :: SCANAUTHOR : Terrance Dicks Doctor Who: Endgame matched BookID 678168, [Terrance Dicks][Doctor Who] 22-Dec-2016 16:51:48 - DEBUG :: SCANAUTHOR : [/Doctor Who Endgame (599)] already scanned 22-Dec-2016 16:51:48 - DEBUG :: SCANAUTHOR : [/Doctor Who_ Endgame (599)] already scanned|

It's getting the bookid: 240950, but then it's saying it matched 678168, so about 30 books are matching to that, and each one is replacing the last's location.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/DobyTang/LazyLibrarian/issues/549#issuecomment-268901259, or mute the thread https://github.com/notifications/unsubscribe-auth/ALmHObsJSIy2TD4z_HYyl9gUM3fRSsOFks5rKvR4gaJpZM4LTn4E.

WillowMist commented 7 years ago

If it's matching on the BookID, should it be name matching at all?

WillowMist commented 7 years ago

Ok, I see what's happening... If the ID doesn't show up in the DB, it tries to match by name. I think it should try to add the book by the ID, if it exists, before trying to fuzzy match.

WillowMist commented 7 years ago

Ok, so if you change line 553 to if not match and not gr_id and add a definition for bookid around 370 or so, it will skip fuzzy matching when a gr_id exists, and pull the actual book from GoodReads.

WillowMist commented 7 years ago

I'm happy to report that with the above change, Calibre integration is going pretty smooth. If you have GoodReads as your only meta processor, it matches the primary author pretty well, adds a goodreads tag, pulls in books that have the same name as another (due to the : in the title). It can create duplicates in books, if the goodreads id points to a different edition of a book than the primary that LL imports, but only on authors that have already been imported before this change, I believe. The duplicates are easily removed, though.

philborman commented 7 years ago

Excellent news. I've just added the new code, but put the goodreads import higher up so that if it fails for any reason we can still try a fuzzy name match. Just running a few more test before I post it to my git.

We try to disallow duplicates in the scanning if the bookid is already in the database, or if the author/title is already in the database, though it does need to be an exact match, so that we don't reject books in a series with only minor changes in name.

WillowMist commented 7 years ago

I've only seen it happen once or twice. Another way to avoid mismatching might be to match on subtitle as well Author and title

On Dec 23, 2016, at 6:20 AM, philborman notifications@github.com wrote:

Excellent news. I've just added the new code, but put the goodreads import higher up so that if it fails for any reason we can still try a fuzzy name match. Just running a few more test before I post it to my git.

We try to disallow duplicates in the scanning if the bookid is already in the database, or if the author/title is already in the database, though it does need to be an exact match, so that we don't reject books in a series with only minor changes in name.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

WillowMist commented 7 years ago

This looks to be working pretty smoothly. Closing the ticket (just in time to open a new one)