bookwyrm-social / bookwyrm

Social reading and reviewing, decentralized with ActivityPub
http://joinbookwyrm.com/
Other
2.24k stars 264 forks source link

Dealing with Duplicate Authors #1119

Open TomatDividedBy0 opened 3 years ago

TomatDividedBy0 commented 3 years ago

Currently an author can be registered separately as Soren Kierkegaard, Søren Kierkegaard, and Sören Kierkegaard, all of which are treated separately and will have books split among them.

Giving users the ability to mark Soren Kierkegaard and Sören Kierkegaard as "aliases" of Søren Kierkegaard would make the cataloguing a lot cleaner.

kopischke commented 3 years ago

With apologies for chiming in on this issue unprompted, I’d like to make the observation that BookWyrm is already halfway there: it is possible to add aliases to an author – they just aren’t hooked up to anything (example – there are author entries for the aliases on that page, but no way to tell).

What is missing is a canonical author representation that integrates all aliases, so that any version of the name points to the same author page. I have no idea by how much that would complicate the model, so I will refrain from judging on how hard this would be to implement, but I will add that there also already is a UX workflow for confirming this kind of linking (the “is this a known author” dialog BookWyrm presents when editing a book’s author), so the change in the model is essentially all that is needed AFAICS.

hughrun commented 2 years ago

Ideally we'd hook into the ISNI database to dedupe authors, but this is one of the hard problems of bibliographic data science so I'm not sure how feasible that is for Bookwyrm's workflow.

mouse-reeve commented 2 years ago

If there's a viable way to legitimately access the ISNI data, I think it could definitely be incorporated (and super valuable). I couldn't find any information on whether the sanction programmatic access or offer API keys, though, and my experience with things in the general OCLC sphere is that they aren't very accessible to people outside of institutions

hughrun commented 2 years ago

I'm happy to play around with this and report back!

The docs are pretty obscure but as best I can tell we just use an SRU request to a completely open API. Unfortunately it returns XML with an XSLT stylesheet but, well, that's OCLC and the library metadata world for you. For example:

http://isni.oclc.org/sru/?query=pica.nw+%3D+%22Soren+Kierkegaard%22&operation=searchRetrieve&recordSchema=isni-b

There also appears to be a mirror at isni.oclc.nl

hughrun commented 2 years ago

Ok as you can see from the draft PR, I have a proof of concept for this, though at present it doesn't do anything particularly useful other than display data in the book editing UI:

søren

We can search for authors in the ISNI database with a free GET request, no API keys required. Then we can display their brief description/bio with a link to their ISNI page. If readers can select the correct author from this list, that will go a long way to reducing duplications, and as @kopischke noted we're already most of the way there: author records already have an isni field and an aliases field, and we can fill or enrich those values from ISNI.

hughrun commented 2 years ago

Don't worry about the weird encoding, I worked it out: utf8

hughrun commented 2 years ago

1581 should reduce the frequency of this problem but doesn't actually provide a way to merge already-existing records. I'll have more of a think about that: it's probably something we'd want admins to do rather than just anyone, but it needs to be easy (like a checkbox yes/no), and probably should use some kind of scheduled or run-on-demand background task.

kopischke commented 2 years ago

1581 should reduce the frequency of this problem but doesn't actually provide a way to merge already-existing records.

Just spitballing here, but one thing we might be able to leverage is the existing alias system, i.e. wherever aliases overlap author names, queue the entries for possible merging.

it's probably something we'd want admins to do rather than just anyone

The question about who should be able to do this is an interesting one I feel might go beyond this specific issue. As of now, due to its small size and devoted community, spam, vandalism and ill intentioned manipulation of data are not an issue on BookWyrm, but if Mastodon (or Goodreads, for that matter) is any reference, that will change once it gains more traction.

hughrun commented 2 years ago

@kopischke yep, the existing name-or-alias query is pretty good. The problem is that names are not unique! So without a truly unique identifier you really need a human to eyeball each potential match at some point. My comment you quote was probably a bit unclear: what I mean is that there is no way to automatically in a guaranteed-to-be-correct way to merge records.

We definitely need another piece of functionality to manually merge them on the basis of some auto-generated helpful hints regarding potential matches - but I don't think the place for that is where I was working this time (the "edit book" workflow).

I realise this may look like a backwards way to come at the problem but I figured it's easier to clean up the old mess if you've at least partially stemmed the flow of new mess coming in.

kopischke commented 2 years ago

The problem is that names are not unique! So without a truly unique identifier you really need a human to eyeball each potential match at some point.

@hughrun totally agree, we can’t and shouldn’t do that automatically. My point was that once we have a system in place to attach an ISNI ID to a BookWyrm Author entity, which this PR provides, we’ll have a starting point for manual merging. By identifying Authors not having an ISNI ID attached, but having an alias matching one or more of those who do, we should get a pretty good base for identifying merge candidates. And yes, that queue should absolutely be reviewed manually – there’s simply too much context to mind (and possibly research for context to do) for anything else.

I realise this may look like a backwards way to come at the problem but I figured it's easier to clean up the old mess if you've at least partially stemmed the flow of new mess coming in.

Again, couldn’t agree more, and I think it actually is the right way around. I was just trying to get the ball rolling for the step after that, which admittedly might be a bit premature.

TomatDividedBy0 commented 2 years ago

1581 should reduce the frequency of this problem but doesn't actually provide a way to merge already-existing records.

Just spitballing here, but one thing we might be able to leverage is the existing alias system, i.e. wherever aliases overlap author names, queue the entries for possible merging.

it's probably something we'd want admins to do rather than just anyone

The question about who should be able to do this is an interesting one I feel might go beyond this specific issue. As of now, due to its small size and devoted community, spam, vandalism and ill intentioned manipulation of data are not an issue on BookWyrm, but if Mastodon (or Goodreads, for that matter) is any reference, that will change once it gains more traction.

If trust is an issue, there's ways to manage that while still getting the benefits of crowdsourcing the archival, some long-term ideas I can spitball:

Although, this does make me wonder? How does the federated model currently handle authors/books, given that you have a common pool of data across multiple instances?

jfinkhaeuser commented 2 years ago

I'm not sure if this is considered, but I also see duplicate authors which, based on the works associated with them, are obviously meant to be the same, and with the exact same spelling. It'd be nice to be able to merge them.

But let's not do this automatically! There are at least two distinct authors called "Rick Wayne", for example! These need to be kept separate. It'd be similarly good to have an option to, essentially, say "this book is actually by a different author of the same name" - that's possible by editing the book, removing the author, re-adding the author, and then choosing "this is a new author". A bit cumbersome, but at least it already works.

mouse-reeve commented 2 years ago

I do plan to add some automatic merging, but it will never happen based on author name -- entries would only be merged if they shared a much more reliable unique identifier like the same ISNI or wikipedia entry. And I agree that automatic merging doesn't fully address the problem that manual merging would.

jfinkhaeuser commented 2 years ago

Another related suggestion: there is, for example, Andre Norton for whom a duplicate entry exists on OpenLibrary. This was imported into bookwyrm, so that some works are attributed to "Andre Norton (duplicate)" (I edited those, there weren't too many).

In addition to the aforementioned aliases, and in light of such issues, it might be good to treat authors a bit like the work/edition split. That is, have a canonical author profile to which other profiles can be linked as duplicates.

When editing an author profile, I imagine there are two options, both of which have a use:

Furthermore:

The rationale here is that it'd be perfectly possible to keep distinct spellings as distinct profiles, if it's a language thing that keeps producing these things - but if it's just users being inattentive or import sources having issues, you can converge on a saner data set over time.

mxamber commented 1 year ago

Any progress here? There are two Annika Brockschmidts and half a dozen Mark Brandis on bookwyrm.social (not to mention garbage imports from OpenLibrary and Inventaire) and importing my collection is currently being a major pain in the arse, because even though all books are attributed correctly, the empty author profiles with 0 books won't go away.

skobkin commented 1 year ago

I'm regularly having the same problem with Japanese light novels while importing them to the instance I have account on. After import new books are bound to wrong author and I need to delete and add author again for each book to keep "series" working. It also could be a good idea to allow selecting authors on import like when adding new books.

arkhi commented 1 year ago

I'm regularly having the same problem with Japanese light novels while importing them to the instance I have account on. After import new books are bound to wrong author and I need to delete and add author again for each book to keep "series" working. It also could be a good idea to allow selecting authors on import like when adding new books.

I have been doing similar work on bookwyrm.social. Every time a book is imported into Bookwyrm, it creates a new author, regardless of an existing author with the same name exists or not.

Would it be possible to hook the existing code about existing author (when adding or editing a book) into the import workflow?

joric3 commented 3 months ago

Dealing with duplicates is still cumbersome. As a short-term improvement, it would be helpful if the choices to link an author to existing entries (or declare it's a new person) would list how many additional info (aliases, external IDs...) the existing entries each have, or how many titles the existing authors are already linked to. When entering new books, the UI currently doesn't help me deciding which of the existing entries (e.g. four which are all for the same author) is the best curated entry.

A long-term improvement definitely would be to find a way to curate author entries across instances by linking them to ISNI IDs or other identifiers (maybe Wikidata and/or VIAF). Person entries without external IDs should be a "last resort"

naught101 commented 2 months ago

this is one of the hard problems of bibliographic data science

Perhaps it would be worth adding the ability for normal users to be able to suggest merges, which could then be reviewed by moderators? This would have a higher chance of being right than anything automated, I think.

Also, MusicBrainz might be a good example to look at for some of how to manage duplicate-sounding authors. It uses a de-duplication system that adds extra metadata to an artist to allow disambiguation. The disambiguation text is usually free-form, but often contains information on artists' genre, dates, or place of origin. So Book Wyrm might have something like (examples from here):

Texts like this are what pop-up in the autocomplete when you're adding a new recording/work on MusicBrainz.