bookwyrm-social / bookwyrm

Social reading and reviewing, decentralized with ActivityPub
http://joinbookwyrm.com/
Other
2.23k stars 262 forks source link

Duplicate book after manual addition #3019

Open prolibre opened 11 months ago

prolibre commented 11 months ago

Describe the bug I think this is a fairly significant bug. If I add a book manually, I've noticed that sometimes it ends up duplicated on some instances. But not all, and not always.

To Reproduce

  1. I've created a new book on my instance. A book that does not yet exist on external sources. https://bw.heraut.eu/book/26362/s/chroniques-poetiques-dun-voyage-a-montreal

  2. I look at bookwyrm social the day after tomorrow and I see that it's a duplicate (even though it's exactly the same book): https://bookwyrm.social/book/1427616/editions

  3. I look on another instance and see that it's a single copy: https://bouquins.zbeul.fr/book/29947/s/chroniques-poetiques-dun-voyage-a-montreal

I've noticed this bug several times and it seems to be a real nuisance. For example (I have two accounts, on two instances) that follow each other. I see that sometimes the book is duplicated on my own timeline. For a book that should be the same, I have two publications that lead to two different books.

I'm using a translator and I hope my message is understandable.

hughrun commented 11 months ago

We will need to do more testing but this sounds like it is a consequence of the federated data model. In this case, what I think may be happening is that bookwyrm.social is federated with more data sources than either bw.heraut.eu or bouquins.zbeul.fr, hence the duplication can be seen there but the match wasn't picked up by bw.heraut.eu when you added the book.

dato commented 11 months ago

I hadn't suffered this when the issue was opened, but I saw something yesterday that makes me think it's not about sources, but true duplication:

A book that I added manually in my single-user instance, appears twice in a remote instance where a follower of mine marked it as "want to read" (and added it to a list). The two editions in the remote instance:

Link to my book:

Editions in the remote instance (identical, consecutive IDs):

(BookWyrm shows just one, when I did a search just now: https://bookwyrm.social/book/1411467/editions)

We will need to do more testing

Sounds like this for sure...

prolibre commented 11 months ago

@dato Yes, that's exactly it. We have very similar dupplicated book ids ( 21OO1, 21002 for example). But what I don't understand is that it's not for every creation. I think there's something really wrong somewhere.

prolibre commented 11 months ago

Looking at Flower I came across a consequence of these duplications : Errors when updating books. When there is a duplicate I get an OK update and an update error on the duplicate.

_bookwyrm.activitypub.base_activity.set_relatedfield FAILURE ('Edition', 'Work', 'parent_work', 'https://bw.heraut.eu/book/27656', 'https://bookwyrm.social/book/1431738')

_bookwyrm.activitypub.base_activity.set_relatedfield SUCCESS ('Edition', 'Work', 'parent_work', 'https://bw.heraut.eu/book/27657', 'https://bookwyrm.social/book/1431738')

Here is the traceback for the error in question:

Traceback (most recent call last):
  File "(...)/bookwyrm/venv/lib/python3.11/site-packages/celery/app/trace.py", line 451, in trace_task
    R = retval = fun(*args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^
  File (...)/bookwyrm/venv/lib/python3.11/site-packages/celery/app/trace.py", line 734, in __protected_call__
    return self.run(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "(...)/bookwyrm/bookwyrm/activitypub/base_activity.py", line 276, in set_related_field
    raise ValueError(f"Invalid related remote id: {related_remote_id}")
ValueError: Invalid related remote id: https://bw.heraut.eu/book/27656
dato commented 10 months ago

I guess there can be multiple causes for this (and multiple codepaths for sure), but at least on my instance I have several examples of duplication caused by a shelving event with a comment (see timestamps below).

I think that actions such as "Finish with comment" come via AP as a single status. But it seems as if... the storing in the database were happening independently?

These are the timestamp of such an event ("Finish with comment") that arrived on my system, followed in quick succession by a ReviewRating (I'm not sure if the behavior triggers without this second event, or not):

16:49:51.468302 - remote Comment object (published_date)
16:49:52.345812 - remote ShelfBook object (shelved_date)
16:49:52.340294 - local Shelf object created (`read-66`)
16:49:53.023008 - first Book created (id=20324)
16:49:53.753374 - local Comment object (references book 20324)
16:49:54.039233 - second Book created (id=20325)
16:49:54.649446 - remote ReviewRating object (published_date)
16:49:55.071635 - local ReviewRating object (references book 20324)
16:49:55.769966 - local ShelfBook object created (references book 20325)

I was debugging missing images on my feed when I came across this.

N.B.: The database already contained a Work for these editions, with an edition in a different language.

prolibre commented 10 months ago

I think it's a pretty big bug and I'm sorry I can't help solve it. Week after week the duplications pile up. Unfortunately I'm pretty good at php and sql, but here I'm unable to help with the project. I regret it :-(

dato commented 10 months ago

I spent a couple hours on this and... to the best of my knowledge, this is due to the deserialization of a Work's editions in parallel with an HTTP request. Consider:

  1. an incoming Add of a ShelfItem, for an Edition E that doesn't exit locally
  2. views.inbox.sometimes_async_activity_task() will perform the Add synchronously
  3. if a work for E doesn't exist already, it will be created from its remote URL
  4. the editions of the remote Work object will be queued for update-or-creation in Celery
    • this happens at ActivityObject.to_model because _Work.deserialize_reversefields includes editions
      • these _set_relatedfield tasks are enqueued while E is not yet created
  5. duplication occurs if a worker picks up the task for E before its (ongoing) creation is complete

If this diagnosis is correct… I'm not sure what the best fix would be: whether in fixing the race condition itself, or also in considering whether it makes sense to deserialize all editions.

This code works beautifully for importing complete objects from remotes but—speaking as an admin—works in big instances can have tens of editions, and it seems weird (and wasteful) that a Note or ShelfItem for a single one of them will bring all to my server, particularly for small instances.

(For now I've dropped "editions" from _Work.deserialize_reversefields in my instance, too see if the duplication stops.)

prolibre commented 10 months ago

I think the explanation lies elsewhere :-( sorry, I hope I'll make myself understood @dato

Looking at another example of duplication, I wonder if there isn't actually duplication on instance A (where the book is created), right from the start. Then, during exchanges (image below) with instance B, two books are created on B.

In this example, I'm a subscriber to Crapounifon (booking.social), on instance A, so I receive its updates. Here the update causes the creation of the book ("Qu'est ce qu'une nation") on my instance (B). And that's where the problems start.

SharedScreenshot

But instance A then links the two books into a single one... instance B doesn't.

If I look at this book (instance A) : https://bookwyrm.social/book/1460971/s/quest-ce-quune-nation I see that another id ( ID-1) exists in the database: https://bookwyrm.social/book/1460970/s/quest-ce-quune-nation ... ID (1460970) which sends to 1460971.

On the other hand, on B there are two distinct books: https://bw.heraut.eu/book/33664/editions

Did I make myself understood ?

dato commented 10 months ago

Hi @prolibre! Yes, you were very clear. Thank you for taking the time to write it down. :)

I think the explanation lies elsewhere

I've read everything, and I've checked all URLs, and I'll try to explain how what you observe is exactly what I described in my previous post.

A bit of BookWyrm internal terminology… (expand)


"Books" in BookWyrm are internally stored as two kinds of objects: Works, and Editions.

Edition is the name for what we normally conceive as book, with its own ISBN, language, format, year of publication, etc.

A Work is a more "abstract" entity that just serves to "group" all Edition objects that refer to the same... well: Work.

Because of this, any given book always has at least two IDs in a BookWyrm instance: one for the Edition, one for its associated Work.

As per the terms above, there is a single Edition of this work in bookwyrm.social, with ID 1460971. On the other hand, ID 1460970 is not an "Edition" object but a "Work", which is just there to hold any future additional editions. See: https://bookwyrm.social/book/1460970/editions.

This work no. 1460970 becomes Work 33664 in your system. But remote edition no. 1460971 becomes Editions 33665 and 33666 on your instance, I'm pretty sure by the process I described yesterday.

But instance A then links the two books into a single one... instance B doesn't.

This is the point that is cleared by the terminology above: they are linked in the same way in both instances.

On bookwyrm.social, the one Work "has" the one edition (IDs above). On yours, one Work "has" two editions (IDs above). They way they are linked is the same. (The redirection you observe happens on your instance too, if you try to visit the Work, as if it was an edition: https://bw.heraut.eu/book/33664/s/quest-ce-quune-nation redirects to 33665).

At the database level, all this can be observed in the origin_id column: if more than one row has the same origin_id, this is the race condition/bug we're talking about. (This field doesn't get exposed in JSON format, otherwise I would point to it.)

prolibre commented 10 months ago

@dato merci beaucoup ! Thank you very much for these explanations. Little by little I'm making progress in my understanding of bookwyrm. Your explanations are very good.

prolibre commented 10 months ago

This code works beautifully for importing complete objects from remotes but—speaking as an admin—works in big instances can have tens of editions, and it seems weird (and wasteful) that a Note or ShelfItem for a single one of them will bring all to my server, particularly for small instances.

(For now I've dropped "editions" from _Work.deserialize_reversefields in my instance, too see if the duplication stops.)

@dato And this action that you've blocked, would it be possible to switch it to "Scheduler" (a task that would be launched from time to time) so that it's carried out during the day rather than immediately after creation ? a sort of background task. But maybe I'm talking nonsense.