bookwyrm-social / bookwyrm

Social reading and reviewing, decentralized with ActivityPub
http://joinbookwyrm.com/
Other
2.26k stars 265 forks source link

Duplicate book after manual addition #3019

Open prolibre opened 1 year ago

prolibre commented 1 year ago

Describe the bug I think this is a fairly significant bug. If I add a book manually, I've noticed that sometimes it ends up duplicated on some instances. But not all, and not always.

To Reproduce

  1. I've created a new book on my instance. A book that does not yet exist on external sources. https://bw.heraut.eu/book/26362/s/chroniques-poetiques-dun-voyage-a-montreal

  2. I look at bookwyrm social the day after tomorrow and I see that it's a duplicate (even though it's exactly the same book): https://bookwyrm.social/book/1427616/editions

  3. I look on another instance and see that it's a single copy: https://bouquins.zbeul.fr/book/29947/s/chroniques-poetiques-dun-voyage-a-montreal

I've noticed this bug several times and it seems to be a real nuisance. For example (I have two accounts, on two instances) that follow each other. I see that sometimes the book is duplicated on my own timeline. For a book that should be the same, I have two publications that lead to two different books.

I'm using a translator and I hope my message is understandable.

hughrun commented 1 year ago

We will need to do more testing but this sounds like it is a consequence of the federated data model. In this case, what I think may be happening is that bookwyrm.social is federated with more data sources than either bw.heraut.eu or bouquins.zbeul.fr, hence the duplication can be seen there but the match wasn't picked up by bw.heraut.eu when you added the book.

dato commented 1 year ago

I hadn't suffered this when the issue was opened, but I saw something yesterday that makes me think it's not about sources, but true duplication:

A book that I added manually in my single-user instance, appears twice in a remote instance where a follower of mine marked it as "want to read" (and added it to a list). The two editions in the remote instance:

Link to my book:

Editions in the remote instance (identical, consecutive IDs):

(BookWyrm shows just one, when I did a search just now: https://bookwyrm.social/book/1411467/editions)

We will need to do more testing

Sounds like this for sure...

prolibre commented 1 year ago

@dato Yes, that's exactly it. We have very similar dupplicated book ids ( 21OO1, 21002 for example). But what I don't understand is that it's not for every creation. I think there's something really wrong somewhere.

prolibre commented 1 year ago

Looking at Flower I came across a consequence of these duplications : Errors when updating books. When there is a duplicate I get an OK update and an update error on the duplicate.

_bookwyrm.activitypub.base_activity.set_relatedfield FAILURE ('Edition', 'Work', 'parent_work', 'https://bw.heraut.eu/book/27656', 'https://bookwyrm.social/book/1431738')

_bookwyrm.activitypub.base_activity.set_relatedfield SUCCESS ('Edition', 'Work', 'parent_work', 'https://bw.heraut.eu/book/27657', 'https://bookwyrm.social/book/1431738')

Here is the traceback for the error in question:

Traceback (most recent call last):
  File "(...)/bookwyrm/venv/lib/python3.11/site-packages/celery/app/trace.py", line 451, in trace_task
    R = retval = fun(*args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^
  File (...)/bookwyrm/venv/lib/python3.11/site-packages/celery/app/trace.py", line 734, in __protected_call__
    return self.run(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "(...)/bookwyrm/bookwyrm/activitypub/base_activity.py", line 276, in set_related_field
    raise ValueError(f"Invalid related remote id: {related_remote_id}")
ValueError: Invalid related remote id: https://bw.heraut.eu/book/27656
dato commented 12 months ago

I guess there can be multiple causes for this (and multiple codepaths for sure), but at least on my instance I have several examples of duplication caused by a shelving event with a comment (see timestamps below).

I think that actions such as "Finish with comment" come via AP as a single status. But it seems as if... the storing in the database were happening independently?

These are the timestamp of such an event ("Finish with comment") that arrived on my system, followed in quick succession by a ReviewRating (I'm not sure if the behavior triggers without this second event, or not):

16:49:51.468302 - remote Comment object (published_date)
16:49:52.345812 - remote ShelfBook object (shelved_date)
16:49:52.340294 - local Shelf object created (`read-66`)
16:49:53.023008 - first Book created (id=20324)
16:49:53.753374 - local Comment object (references book 20324)
16:49:54.039233 - second Book created (id=20325)
16:49:54.649446 - remote ReviewRating object (published_date)
16:49:55.071635 - local ReviewRating object (references book 20324)
16:49:55.769966 - local ShelfBook object created (references book 20325)

I was debugging missing images on my feed when I came across this.

N.B.: The database already contained a Work for these editions, with an edition in a different language.

prolibre commented 11 months ago

I think it's a pretty big bug and I'm sorry I can't help solve it. Week after week the duplications pile up. Unfortunately I'm pretty good at php and sql, but here I'm unable to help with the project. I regret it :-(

dato commented 11 months ago

I spent a couple hours on this and... to the best of my knowledge, this is due to the deserialization of a Work's editions in parallel with an HTTP request. Consider:

  1. an incoming Add of a ShelfItem, for an Edition E that doesn't exit locally
  2. views.inbox.sometimes_async_activity_task() will perform the Add synchronously
  3. if a work for E doesn't exist already, it will be created from its remote URL
  4. the editions of the remote Work object will be queued for update-or-creation in Celery
    • this happens at ActivityObject.to_model because _Work.deserialize_reversefields includes editions
      • these _set_relatedfield tasks are enqueued while E is not yet created
  5. duplication occurs if a worker picks up the task for E before its (ongoing) creation is complete

If this diagnosis is correct… I'm not sure what the best fix would be: whether in fixing the race condition itself, or also in considering whether it makes sense to deserialize all editions.

This code works beautifully for importing complete objects from remotes but—speaking as an admin—works in big instances can have tens of editions, and it seems weird (and wasteful) that a Note or ShelfItem for a single one of them will bring all to my server, particularly for small instances.

(For now I've dropped "editions" from _Work.deserialize_reversefields in my instance, too see if the duplication stops.)

prolibre commented 11 months ago

I think the explanation lies elsewhere :-( sorry, I hope I'll make myself understood @dato

Looking at another example of duplication, I wonder if there isn't actually duplication on instance A (where the book is created), right from the start. Then, during exchanges (image below) with instance B, two books are created on B.

In this example, I'm a subscriber to Crapounifon (booking.social), on instance A, so I receive its updates. Here the update causes the creation of the book ("Qu'est ce qu'une nation") on my instance (B). And that's where the problems start.

SharedScreenshot

But instance A then links the two books into a single one... instance B doesn't.

If I look at this book (instance A) : https://bookwyrm.social/book/1460971/s/quest-ce-quune-nation I see that another id ( ID-1) exists in the database: https://bookwyrm.social/book/1460970/s/quest-ce-quune-nation ... ID (1460970) which sends to 1460971.

On the other hand, on B there are two distinct books: https://bw.heraut.eu/book/33664/editions

Did I make myself understood ?

dato commented 11 months ago

Hi @prolibre! Yes, you were very clear. Thank you for taking the time to write it down. :)

I think the explanation lies elsewhere

I've read everything, and I've checked all URLs, and I'll try to explain how what you observe is exactly what I described in my previous post.

A bit of BookWyrm internal terminology… (expand)


"Books" in BookWyrm are internally stored as two kinds of objects: Works, and Editions.

Edition is the name for what we normally conceive as book, with its own ISBN, language, format, year of publication, etc.

A Work is a more "abstract" entity that just serves to "group" all Edition objects that refer to the same... well: Work.

Because of this, any given book always has at least two IDs in a BookWyrm instance: one for the Edition, one for its associated Work.

As per the terms above, there is a single Edition of this work in bookwyrm.social, with ID 1460971. On the other hand, ID 1460970 is not an "Edition" object but a "Work", which is just there to hold any future additional editions. See: https://bookwyrm.social/book/1460970/editions.

This work no. 1460970 becomes Work 33664 in your system. But remote edition no. 1460971 becomes Editions 33665 and 33666 on your instance, I'm pretty sure by the process I described yesterday.

But instance A then links the two books into a single one... instance B doesn't.

This is the point that is cleared by the terminology above: they are linked in the same way in both instances.

On bookwyrm.social, the one Work "has" the one edition (IDs above). On yours, one Work "has" two editions (IDs above). They way they are linked is the same. (The redirection you observe happens on your instance too, if you try to visit the Work, as if it was an edition: https://bw.heraut.eu/book/33664/s/quest-ce-quune-nation redirects to 33665).

At the database level, all this can be observed in the origin_id column: if more than one row has the same origin_id, this is the race condition/bug we're talking about. (This field doesn't get exposed in JSON format, otherwise I would point to it.)

prolibre commented 11 months ago

@dato merci beaucoup ! Thank you very much for these explanations. Little by little I'm making progress in my understanding of bookwyrm. Your explanations are very good.

prolibre commented 11 months ago

This code works beautifully for importing complete objects from remotes but—speaking as an admin—works in big instances can have tens of editions, and it seems weird (and wasteful) that a Note or ShelfItem for a single one of them will bring all to my server, particularly for small instances.

(For now I've dropped "editions" from _Work.deserialize_reversefields in my instance, too see if the duplication stops.)

@dato And this action that you've blocked, would it be possible to switch it to "Scheduler" (a task that would be launched from time to time) so that it's carried out during the day rather than immediately after creation ? a sort of background task. But maybe I'm talking nonsense.

nesges commented 1 month ago

I'm not quite sure if our issue is exactly the same, but we see a lot of duplicate books and authors on our installation of bookwyrm. We believe we have observed that sometimes after a new book or new author has been created, an identical entry is created “some time later” (within minutes). Also in most cases, the IDs of the duplicates directly follow the original. We are now wondering how we can systematically get to the bottom of this, or whether the problem is known and may already have a workaround.

prolibre commented 1 month ago

@nesges So I'm pretty desperate about this bug :-) I think there are dozens (hundreds?) of duplicates and I've noticed this on several instances. When I create a book on my instance I notice that it's duplicated on instances that are linked to mine (and vice versa). I've got plenty of examples, unfortunately.

nesges commented 3 weeks ago

Found and merged a lot of duplicate authors: I did about 2500 manual merges on about 11.700 authors. Here's how to find dupe authors, in case anyone is interested:

-- find dupes by exact fullname
select id, name, lower(name) from bookwyrm_author where lower(name) in 
    (select lower(name) from bookwyrm_author group by lower(name) having count(lower(name)) > 1)
    order by lower(name) asc, id asc;

-- find dupes by fuzzy fullname
select id, name, metaphone(name, 20) from bookwyrm_author where metaphone(name,20) in 
    (select metaphone(name,20) from bookwyrm_author group by metaphone(name,20) having count(metaphone(name,20)) > 1)
    order by metaphone(name,20) asc, name asc, id asc;

-- exact last name: replace metaphone(name) with REGEXP_REPLACE(name, '^.* ', '') etc.

You then have to check all authors with the same lower name/metaphone and merge them like described in https://docs.joinbookwyrm.com/management-commands.html.

I haven't figured out a proper select for duplicate books yet. Ideas are very welcome!

hughrun commented 2 weeks ago

@dato I'm keen to see if we can fix this.

(For now I've dropped "editions" from Work.deserialize_reverse_fields in my instance, too see if the duplication stops.)

Do you have any further data on this?

hughrun commented 2 weeks ago

Ok I think I understand the explanation from @dato now:

  1. Actions on an Edition that doesn't already exist in an instance will attempt to import that edition from the ActivityPub activity (e.g. an external user adds an edition to their To Read self).
  2. Importing an Edition will run to_model on the Edition
  3. to_model wants to import the parent Work before it will save the Edition because all Editions must have a parent Work
  4. Importing the Work will run to_model on the Work synchronously
  5. to_model on a Work triggers the import of all child Editions, but as a series of Celery jobs (i.e. asynchronously). This recursively includes the Edition we were importing in the first place
  6. If the Celery job runs fast enough, it will save a copy of the Edition before the original thread finishes: find_existing won't pick it up (because it doesn't exist yet), but has already run on the original thread so it won't stop creation there either.
  7. Now we have two copies of the same Edition

@dato @mouse-reeve am I understanding this correctly?

Re this:

works in big instances can have tens of editions, and it seems weird (and wasteful) that a Note or ShelfItem for a single one of them will bring all to my server, particularly for small instances.

As a Mastodon instance admin I'm very sympathetic to this view (using default settings Mastodon seems oblivious to the concept of "object storage bills"). On the other hand, if we simply turned off automatic import of all Editions when importing a Work, we will make the problem of duplicates worse - users won't be able to see any of the existing editions out there in the BookWyrm federated database, and will create many new "local" editions instead.

One possibility might be to adjust the workflow of adding Editions such that instead of pulling all known editions in at the point of adding a Work, we only add Editions as needed. This would require something like a "find more editions" button on Works, and forcing users to search for existing editions (much like searching for existing Works at the moment) before they are allowed to add a new Edition. Not sure if that makes sense? An advantage of this would be that each instance searches for known Editions at the point of a user wanting to find an Edition rather than merely the first time someone searched for a Work, which may have been some time ago before newer Editions were published and added on external instances.

hughrun commented 2 weeks ago

Ok I spent a fair chunk of time looking at this today. In a lot of situations we're potentially importing the same Edition multiple times more or less simultaneously, which I think might be the culprit here. Consider an example where an external user followed by local users adds an edition to a shelf, and we have not seen this book before. We receive an ADD Activity for the edition to be added to the shelf, and we also receive a CREATE activity for a Note (and possibly a third one for a Comment) which also has the book in the inReplyToBook field:

THREAD 1 - Add an Edition to a Shelf

  1. import Edition: to_model
  2. create parent Work and save() the Work
  3. Add all the editions asynchronously using set_related_field
  4. save() the Edition - should be attached to the work

We now have at least one new Edition and related new Work

THREAD 2 - Create a Comment with an inReplyToBook

  1. Import Edition: to_model
  2. Thread 1 hasn't finished saving the Edition yet, so resolve_remote_id doesn't find it and we go ahead and create a duplicate copy
  3. Thread 1 finishes creating the Work but has not yet finished creating the Edition
  4. to_model is creating a new Edition, so it needs a parent_work.
  5. If checks for an existing Work using find_existing and finds it!
  6. It adds or updates all the editions using set_related_field

We now have all the editions from the external Work, plus an extra copy of the edition we wanted in the first place, attached to the same Work.

I'm not really sure what the correct solution is.

prolibre commented 2 weeks ago

@hughrun Looking in the database I see that both my duplicates have the same “origin_id”. Isn't it possible to use this to avoid duplicates? Just a thought.

hughrun commented 2 weeks ago

Yep that's how it's supposed to work, in ActivityPubMixin:

https://github.com/bookwyrm-social/bookwyrm/blob/d16faac6dae33e8155877ae247862158957e2648/bookwyrm/models/activitypub_mixin.py#L94C1-L114C54

prolibre commented 1 week ago

@hughrun

I'm thinking of a way to clean up the database by removing duplicates. I tried this query to find duplicate books:

SELECT
        MIN(bb.id) AS min_id,
        MAX(bb.id) AS max_id,
        bb.origin_id,
        COUNT(*) AS duplicate_count,
        ROW_NUMBER() OVER() AS row_number
    FROM
        public.bookwyrm_book AS bb
    JOIN
        public.bookwyrm_edition AS be ON bb.id = be.book_ptr_id
    WHERE
        bb.origin_id IS NOT NULL
    GROUP BY
        bb.origin_id
    HAVING
        COUNT(*) > 1

I still get 3592 lines back (with a maximum of 3 duplicates, but a majority of 1 duplicate, i.e. two copies per book). You can try it on your instances.

Have a good day.

EDIT : I've added a constraint (the id must exist in the editions table). I'm fumbling because I don't really understand the links between the book table (bookwyrm_book) and the edition table (bookwyrm_edition).

nesges commented 1 week ago

@prolibre matching solely over origin_id is flawed, because dupes may have been altered by the users. Compare https://buecher.pnpde.social/book/67290/s/greenwild and https://buecher.pnpde.social/book/67291/s/greenwild, they where found by your statement but even have different ISBN numbers.

I'm in the process of manually checking dupes found with a similar statement:

select b.id, b.title as name, b.title as normalized 
    from bookwyrm_book b
    where origin_id in (
        select origin_id from bookwyrm_book 
        group by origin_id having count(origin_id)>1
    )
    and id not in (select merged_into_id from bookwyrm_mergedbook)

It's painful but absolutely necessary to have a look at and compare all the info of a book/edition. I guess you could automate merging of editions that are the same in every regard though

prolibre commented 1 week ago

Hello @nesges I use my query to find duplicates, then I give it to the command venv/bin/python3 manage.py merge_editions I've run a few tests and it seems to work fine.

prolibre commented 1 week ago

So for the record I've created a little script that uses the query I mentioned above to merge (with merge_editions) the duplicates. I went from about 90,000 books to about 70,000 books in the database. There are still some duplicates (duplicate “editions”) such as : https://bw.heraut.eu/book/46959 and https://bw.heraut.eu/book/46960 but I don't really know how to go about automating their merging because I don't master the data structure.

Good day to all