Open prolibre opened 1 year ago
We will need to do more testing but this sounds like it is a consequence of the federated data model. In this case, what I think may be happening is that bookwyrm.social
is federated with more data sources than either bw.heraut.eu
or bouquins.zbeul.fr
, hence the duplication can be seen there but the match wasn't picked up by bw.heraut.eu
when you added the book.
I hadn't suffered this when the issue was opened, but I saw something yesterday that makes me think it's not about sources, but true duplication:
A book that I added manually in my single-user instance, appears twice in a remote instance where a follower of mine marked it as "want to read" (and added it to a list). The two editions in the remote instance:
Link to my book:
Editions in the remote instance (identical, consecutive IDs):
(BookWyrm shows just one, when I did a search just now: https://bookwyrm.social/book/1411467/editions)
We will need to do more testing
Sounds like this for sure...
@dato Yes, that's exactly it. We have very similar dupplicated book ids ( 21OO1, 21002 for example). But what I don't understand is that it's not for every creation. I think there's something really wrong somewhere.
Looking at Flower I came across a consequence of these duplications : Errors when updating books. When there is a duplicate I get an OK update and an update error on the duplicate.
_bookwyrm.activitypub.base_activity.set_relatedfield FAILURE ('Edition', 'Work', 'parent_work', 'https://bw.heraut.eu/book/27656', 'https://bookwyrm.social/book/1431738')
_bookwyrm.activitypub.base_activity.set_relatedfield SUCCESS ('Edition', 'Work', 'parent_work', 'https://bw.heraut.eu/book/27657', 'https://bookwyrm.social/book/1431738')
Here is the traceback for the error in question:
Traceback (most recent call last):
File "(...)/bookwyrm/venv/lib/python3.11/site-packages/celery/app/trace.py", line 451, in trace_task
R = retval = fun(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^
File (...)/bookwyrm/venv/lib/python3.11/site-packages/celery/app/trace.py", line 734, in __protected_call__
return self.run(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
^^^^^^^^^^^^^^^^^^^
File "(...)/bookwyrm/bookwyrm/activitypub/base_activity.py", line 276, in set_related_field
raise ValueError(f"Invalid related remote id: {related_remote_id}")
ValueError: Invalid related remote id: https://bw.heraut.eu/book/27656
I guess there can be multiple causes for this (and multiple codepaths for sure), but at least on my instance I have several examples of duplication caused by a shelving event with a comment (see timestamps below).
I think that actions such as "Finish with comment" come via AP as a single status. But it seems as if... the storing in the database were happening independently?
These are the timestamp of such an event ("Finish with comment") that arrived on my system, followed in quick succession by a ReviewRating (I'm not sure if the behavior triggers without this second event, or not):
16:49:51.468302 - remote Comment object (published_date)
16:49:52.345812 - remote ShelfBook object (shelved_date)
16:49:52.340294 - local Shelf object created (`read-66`)
16:49:53.023008 - first Book created (id=20324)
16:49:53.753374 - local Comment object (references book 20324)
16:49:54.039233 - second Book created (id=20325)
16:49:54.649446 - remote ReviewRating object (published_date)
16:49:55.071635 - local ReviewRating object (references book 20324)
16:49:55.769966 - local ShelfBook object created (references book 20325)
I was debugging missing images on my feed when I came across this.
N.B.: The database already contained a Work for these editions, with an edition in a different language.
I think it's a pretty big bug and I'm sorry I can't help solve it. Week after week the duplications pile up. Unfortunately I'm pretty good at php and sql, but here I'm unable to help with the project. I regret it :-(
I spent a couple hours on this and... to the best of my knowledge, this is due to the deserialization of a Work's editions in parallel with an HTTP request. Consider:
E
that doesn't exit locallyE
doesn't exist already, it will be created from its remote URLeditions
of the remote Work object will be queued for update-or-creation in Celery
editions
E
is not yet createdE
before its (ongoing) creation is completeIf this diagnosis is correct… I'm not sure what the best fix would be: whether in fixing the race condition itself, or also in considering whether it makes sense to deserialize all editions.
This code works beautifully for importing complete objects from remotes but—speaking as an admin—works in big instances can have tens of editions, and it seems weird (and wasteful) that a Note or ShelfItem for a single one of them will bring all to my server, particularly for small instances.
(For now I've dropped "editions" from _Work.deserialize_reversefields in my instance, too see if the duplication stops.)
I think the explanation lies elsewhere :-( sorry, I hope I'll make myself understood @dato
Looking at another example of duplication, I wonder if there isn't actually duplication on instance A (where the book is created), right from the start. Then, during exchanges (image below) with instance B, two books are created on B.
In this example, I'm a subscriber to Crapounifon (booking.social), on instance A, so I receive its updates. Here the update causes the creation of the book ("Qu'est ce qu'une nation") on my instance (B). And that's where the problems start.
But instance A then links the two books into a single one... instance B doesn't.
If I look at this book (instance A) : https://bookwyrm.social/book/1460971/s/quest-ce-quune-nation I see that another id ( ID-1) exists in the database: https://bookwyrm.social/book/1460970/s/quest-ce-quune-nation ... ID (1460970) which sends to 1460971.
On the other hand, on B there are two distinct books: https://bw.heraut.eu/book/33664/editions
Did I make myself understood ?
Hi @prolibre! Yes, you were very clear. Thank you for taking the time to write it down. :)
I think the explanation lies elsewhere
I've read everything, and I've checked all URLs, and I'll try to explain how what you observe is exactly what I described in my previous post.
"Books" in BookWyrm are internally stored as two kinds of objects: Works, and Editions.
Edition is the name for what we normally conceive as book, with its own ISBN, language, format, year of publication, etc.
A Work is a more "abstract" entity that just serves to "group" all Edition objects that refer to the same... well: Work.
Because of this, any given book always has at least two IDs in a BookWyrm instance: one for the Edition, one for its associated Work.
As per the terms above, there is a single Edition of this work in bookwyrm.social, with ID 1460971. On the other hand, ID 1460970 is not an "Edition" object but a "Work", which is just there to hold any future additional editions. See: https://bookwyrm.social/book/1460970/editions.
This work no. 1460970 becomes Work 33664 in your system. But remote edition no. 1460971 becomes Editions 33665 and 33666 on your instance, I'm pretty sure by the process I described yesterday.
But instance A then links the two books into a single one... instance B doesn't.
This is the point that is cleared by the terminology above: they are linked in the same way in both instances.
On bookwyrm.social, the one Work "has" the one edition (IDs above). On yours, one Work "has" two editions (IDs above). They way they are linked is the same. (The redirection you observe happens on your instance too, if you try to visit the Work, as if it was an edition: https://bw.heraut.eu/book/33664/s/quest-ce-quune-nation redirects to 33665).
At the database level, all this can be observed in the origin_id
column: if more than one row has the same origin_id, this is the race condition/bug we're talking about. (This field doesn't get exposed in JSON format, otherwise I would point to it.)
@dato merci beaucoup ! Thank you very much for these explanations. Little by little I'm making progress in my understanding of bookwyrm. Your explanations are very good.
This code works beautifully for importing complete objects from remotes but—speaking as an admin—works in big instances can have tens of editions, and it seems weird (and wasteful) that a Note or ShelfItem for a single one of them will bring all to my server, particularly for small instances.
(For now I've dropped "editions" from _Work.deserialize_reversefields in my instance, too see if the duplication stops.)
@dato And this action that you've blocked, would it be possible to switch it to "Scheduler" (a task that would be launched from time to time) so that it's carried out during the day rather than immediately after creation ? a sort of background task. But maybe I'm talking nonsense.
I'm not quite sure if our issue is exactly the same, but we see a lot of duplicate books and authors on our installation of bookwyrm. We believe we have observed that sometimes after a new book or new author has been created, an identical entry is created “some time later” (within minutes). Also in most cases, the IDs of the duplicates directly follow the original. We are now wondering how we can systematically get to the bottom of this, or whether the problem is known and may already have a workaround.
@nesges So I'm pretty desperate about this bug :-) I think there are dozens (hundreds?) of duplicates and I've noticed this on several instances. When I create a book on my instance I notice that it's duplicated on instances that are linked to mine (and vice versa). I've got plenty of examples, unfortunately.
Found and merged a lot of duplicate authors: I did about 2500 manual merges on about 11.700 authors. Here's how to find dupe authors, in case anyone is interested:
-- find dupes by exact fullname
select id, name, lower(name) from bookwyrm_author where lower(name) in
(select lower(name) from bookwyrm_author group by lower(name) having count(lower(name)) > 1)
order by lower(name) asc, id asc;
-- find dupes by fuzzy fullname
select id, name, metaphone(name, 20) from bookwyrm_author where metaphone(name,20) in
(select metaphone(name,20) from bookwyrm_author group by metaphone(name,20) having count(metaphone(name,20)) > 1)
order by metaphone(name,20) asc, name asc, id asc;
-- exact last name: replace metaphone(name) with REGEXP_REPLACE(name, '^.* ', '') etc.
You then have to check all authors with the same lower name/metaphone and merge them like described in https://docs.joinbookwyrm.com/management-commands.html.
I haven't figured out a proper select for duplicate books yet. Ideas are very welcome!
@dato I'm keen to see if we can fix this.
(For now I've dropped "editions" from Work.deserialize_reverse_fields in my instance, too see if the duplication stops.)
Do you have any further data on this?
Ok I think I understand the explanation from @dato now:
Edition
that doesn't already exist in an instance will attempt to import that edition from the ActivityPub activity (e.g. an external user adds an edition to their To Read self).to_model
on the Editionto_model
wants to import the parent Work before it will save the Edition because all Editions must have a parent Workto_model
on the Work synchronouslyto_model
on a Work triggers the import of all child Editions, but as a series of Celery jobs (i.e. asynchronously). This recursively includes the Edition we were importing in the first placefind_existing
won't pick it up (because it doesn't exist yet), but has already run on the original thread so it won't stop creation there either.@dato @mouse-reeve am I understanding this correctly?
Re this:
works in big instances can have tens of editions, and it seems weird (and wasteful) that a Note or ShelfItem for a single one of them will bring all to my server, particularly for small instances.
As a Mastodon instance admin I'm very sympathetic to this view (using default settings Mastodon seems oblivious to the concept of "object storage bills"). On the other hand, if we simply turned off automatic import of all Editions
when importing a Work
, we will make the problem of duplicates worse - users won't be able to see any of the existing editions out there in the BookWyrm federated database, and will create many new "local" editions instead.
One possibility might be to adjust the workflow of adding Editions such that instead of pulling all known editions in at the point of adding a Work, we only add Editions as needed. This would require something like a "find more editions" button on Works, and forcing users to search for existing editions (much like searching for existing Works at the moment) before they are allowed to add a new Edition. Not sure if that makes sense? An advantage of this would be that each instance searches for known Editions at the point of a user wanting to find an Edition rather than merely the first time someone searched for a Work, which may have been some time ago before newer Editions were published and added on external instances.
Ok I spent a fair chunk of time looking at this today. In a lot of situations we're potentially importing the same Edition multiple times more or less simultaneously, which I think might be the culprit here. Consider an example where an external user followed by local users adds an edition to a shelf, and we have not seen this book before. We receive an ADD
Activity for the edition to be added to the shelf, and we also receive a CREATE
activity for a Note
(and possibly a third one for a Comment
) which also has the book in the inReplyToBook
field:
to_model
Work
and save() the Workset_related_field
We now have at least one new Edition and related new Work
inReplyToBook
to_model
resolve_remote_id
doesn't find it and we go ahead and create a duplicate copyWork
but has not yet finished creating the Edition
to_model
is creating a new Edition
, so it needs a parent_work
.Work
using find_existing
and finds it!set_related_field
We now have all the editions from the external Work, plus an extra copy of the edition we wanted in the first place, attached to the same Work.
I'm not really sure what the correct solution is.
@hughrun Looking in the database I see that both my duplicates have the same “origin_id”. Isn't it possible to use this to avoid duplicates? Just a thought.
Yep that's how it's supposed to work, in ActivityPubMixin
:
@hughrun
I'm thinking of a way to clean up the database by removing duplicates. I tried this query to find duplicate books:
SELECT
MIN(bb.id) AS min_id,
MAX(bb.id) AS max_id,
bb.origin_id,
COUNT(*) AS duplicate_count,
ROW_NUMBER() OVER() AS row_number
FROM
public.bookwyrm_book AS bb
JOIN
public.bookwyrm_edition AS be ON bb.id = be.book_ptr_id
WHERE
bb.origin_id IS NOT NULL
GROUP BY
bb.origin_id
HAVING
COUNT(*) > 1
I still get 3592 lines back (with a maximum of 3 duplicates, but a majority of 1 duplicate, i.e. two copies per book). You can try it on your instances.
Have a good day.
EDIT : I've added a constraint (the id must exist in the editions table). I'm fumbling because I don't really understand the links between the book table (bookwyrm_book) and the edition table (bookwyrm_edition).
@prolibre matching solely over origin_id is flawed, because dupes may have been altered by the users. Compare https://buecher.pnpde.social/book/67290/s/greenwild and https://buecher.pnpde.social/book/67291/s/greenwild, they where found by your statement but even have different ISBN numbers.
I'm in the process of manually checking dupes found with a similar statement:
select b.id, b.title as name, b.title as normalized
from bookwyrm_book b
where origin_id in (
select origin_id from bookwyrm_book
group by origin_id having count(origin_id)>1
)
and id not in (select merged_into_id from bookwyrm_mergedbook)
It's painful but absolutely necessary to have a look at and compare all the info of a book/edition. I guess you could automate merging of editions that are the same in every regard though
Hello @nesges
I use my query to find duplicates, then I give it to the command
venv/bin/python3 manage.py merge_editions
I've run a few tests and it seems to work fine.
So for the record I've created a little script that uses the query I mentioned above to merge (with merge_editions) the duplicates. I went from about 90,000 books to about 70,000 books in the database. There are still some duplicates (duplicate “editions”) such as : https://bw.heraut.eu/book/46959 and https://bw.heraut.eu/book/46960 but I don't really know how to go about automating their merging because I don't master the data structure.
Good day to all
Describe the bug I think this is a fairly significant bug. If I add a book manually, I've noticed that sometimes it ends up duplicated on some instances. But not all, and not always.
To Reproduce
I've created a new book on my instance. A book that does not yet exist on external sources. https://bw.heraut.eu/book/26362/s/chroniques-poetiques-dun-voyage-a-montreal
I look at bookwyrm social the day after tomorrow and I see that it's a duplicate (even though it's exactly the same book): https://bookwyrm.social/book/1427616/editions
I look on another instance and see that it's a single copy: https://bouquins.zbeul.fr/book/29947/s/chroniques-poetiques-dun-voyage-a-montreal
I've noticed this bug several times and it seems to be a real nuisance. For example (I have two accounts, on two instances) that follow each other. I see that sometimes the book is duplicated on my own timeline. For a book that should be the same, I have two publications that lead to two different books.
I'm using a translator and I hope my message is understandable.