backstage / backstage

Backstage is an open framework for building developer portals
https://backstage.io/
Apache License 2.0
28.02k stars 5.93k forks source link

🐛 Bug Report: stitching process might produce duplicate entities to `search` table #26997

Open zcmander opened 1 week ago

zcmander commented 1 week ago

📜 Description

Recently we enabled paginated catalog view in our instance of Backstage. This changes the view to fetch entities from catalog using queryEntities call instead of battle-tested getEntities that does not support pagination at all.

However, soon after deploying the change to production we notices that the catalog returns duplicate entities in responses. It was weird as those entities does have the same uid in it so we are certain that those are not duplicate in our database but rather somekind of logic error in queries or how data is fetched.

The queryEntities fetches data by joining two tables: refresh_state and search table. I did check that the refresh_state did not contain any duplicate data but the search table does.

You can check out your own database if this is the case by running:

SELECT value, count(value) as count from search WHERE key = 'metadata.uid' GROUP BY value HAVING COUNT(value) > 1 ORDER BY count;

I'm suspecting that the following change is the reason why we stated seeing this unexpected result:

https://github.com/backstage/backstage/commit/53cce8663178091f8ff46c40a48db4e06ebf9159#diff-7841edf78cd947efd000b8b63a59d6bfbe1bb55c20e76fb810b2fedb33211aa2

As it joins search table in a way that it could produce duplicate entities (before the query was constructed in a way it would hide duplicates if there's any).

How this can be fixed??

I'm not too familiar with the catalog codebase but couple of things comes to my mind:

👍 Expected behavior

There should be no duplicates when calling queryEntities.

👎 Actual Behavior with Screenshots

This is a bug in API but visible if paginated catalog is enabled (as it is by default)

👟 Reproduction steps

Not sure how, I suspect that the root cause for the bug is somekind of race-condition and that's why running multiple nodes is required.

📃 Provide the context for the Bug.

🖥️ Your Environment

No response

👀 Have you spent some time to check if this bug has been raised before?

🏢 Have you read the Code of Conduct?

Are you willing to submit PR?

No, but I'm happy to collaborate on a PR with someone else

zcmander commented 1 week ago

area:catalog would be better label for this.

zcmander commented 1 week ago

Initial fix done (thx @drodil), maybe the best to leave this open as it's only improvement to current situation and does not fully fix the issue as it needs more time and effort to do it properly at database level.