SSHOC / sshoc-marketplace-backend

Code for the backend
Apache License 2.0
2 stars 0 forks source link

Merge of items gives error "Source item id is required" #169

Closed dpancic closed 1 year ago

dpancic commented 1 year ago

In GitLab by @KlausIllmayer on Oct 24, 2022, 17:07

When merging items, we got for some items the error Source item id is required because source was matched with '{Source}' by Accessible at Url.

Looking for the reason, I found out, that when merging an item the algorithm is looking for possible existing sources based on the host of the first accessibleAt-value. If it finds such a value, it will raise the error. In our example, we had a GitHub-URL as accessibleAt-value. As there already exists a source also having a GitHub-URL as source-URL, the mentioned error raises. I also tried it out with a Zotero-URL for accessibleAt, as we also have a source-URL referencing Zotero. The same error occurs. The problem raises, because the algorithm is only comparing the host-part of the URLs.

Have a look at https://gitlab.gwdg.de/sshoc/sshoc-marketplace-backend/-/blob/develop/src/main/java/eu/sshopencloud/marketplace/validators/items/ItemFactory.java#L86 where a source is looked for, based on the accessibleAtUri (and it takes only the first as you can see at https://gitlab.gwdg.de/sshoc/sshoc-marketplace-backend/-/blob/develop/src/main/java/eu/sshopencloud/marketplace/validators/items/ItemFactory.java#L83) - as the merged item does not contain a source (see discussion in https://gitlab.gwdg.de/sshoc/sshoc-marketplace-backend/-/issues/97) it still looks for the accessibleAt. Based on https://gitlab.gwdg.de/sshoc/sshoc-marketplace-backend/-/blob/develop/src/main/java/eu/sshopencloud/marketplace/validators/sources/SourceFactory.java#L86 it only looks for the hosts. As we have one source with host https://github.com (the source url itself is a full github-URL: https://github.com/clarin-eric/resource-families-html-generator/tree/master/resource_families) we run into the incident, that every merge that has a github-URL on first position of accessibleAt is not possible and raises the mentioned error.

If you like, you can reproduce it, by creating two items, having a github-URL as accessibleAt and trying to merge them. Even worse, when trying it out now this way, I discovered that even creating an item that has only a github-URL in accessibleAt already raises the error. This means, that actually it is not possible to create items that have an accessibleAt-url that shares the same host as any of the source-urls (you can try it with https://www.zotero.org/... or with https://www.cessda.eu/... or one of the other 15 source-urls we currently have). Based on this, I would say that it is major bug that we see here.

Notify @laureD19 @cesareconcordia @egray523, @olanowak

dpancic commented 1 year ago

In GitLab by @KlausIllmayer on Oct 24, 2022, 17:13

Not sure, how to solve it, but only looking for host seems to be unreliable.

dpancic commented 1 year ago

In GitLab by @vronk on Oct 26, 2022, 15:20

I have to admit I don't understand why the algorithm is actually matching Source with accessibleAt-URI? How are these connected?

dpancic commented 1 year ago

In GitLab by @tparkola on Oct 26, 2022, 19:06

changed the severity to High - S2

dpancic commented 1 year ago

In GitLab by @tparkola on Oct 26, 2022, 19:18

Indeed, there is such behaviour. But, as I see it in the code, it does not base on accessinleAt only. And I guess it was intentional. To explain: the system explicitly tries to identify source by id or (if not present) by accessibleAt parameter. The connection between source and URI - well - if you assume that one domain is connected to only one source, then you have the connection :) We know its not the case here, so... it can be changed/fixed (i.e. identification of the source by accessibleAt can be simply removed (and only by source id left)). But the question is - do you see any threats, especially in the ingestion phase? Situations in which you somehow base on this bevarious? To say it in different way: when creating or merging an item, can we assume that if the item is coming from a specific source then the request will always hold this information together with source item id?

dpancic commented 1 year ago

In GitLab by @KlausIllmayer on Oct 27, 2022, 14:06

I also remember, that it was intentional. To prevent items to be added that are already there (or that were intentional deleted). But only looking for host shows up as a too generic shortcut. I think it is still valid to check the combination of source.url and item.sourceItemId but that could become a complex query. As a trade-off we could agree on looking if the item.accessibleAt starts with one of the many source.url. This should solve the issue in my opinion and should not interfere with current behaviour, as we weren't aware about the algorithm only looking for the host.

dpancic commented 1 year ago

In GitLab by @KlausIllmayer on Oct 27, 2022, 14:30

adding @olanowak to the discussion

dpancic commented 1 year ago

In GitLab by @vronk on Oct 28, 2022, 15:49

I may be overlooking something, but in my understanding during ingest there already is a lookup mechanism based on item.sourceItemId. And in when creating an item accessibleAt could be compared to existing accessibleAt to avoid lookups, but I don'T see added value of mixing those two. And I don't at all understand why this check happens during merge?

So in my naive opinion the check should be simply removed.

dpancic commented 1 year ago

In GitLab by @KlausIllmayer on Oct 28, 2022, 15:56

It is part of the default initializing of items (ItemFactory.initializeItem()), which is independent from the merge algorithm. Merge is using this method when intializing an item. Someone would need to adapt the ItemFactory.initializeItem() to add a check for either activating it or not, depending on the context. Currently it is always checked it an item is initialized (be it create an item, merge an item, whatever you else do with an item).

dpancic commented 1 year ago

In GitLab by @tparkola on Oct 31, 2022, 12:52

Just to clarify - this is not a check. It is a way of finding a source for a given item. So what we discuss is not a check / validation / verification. It is a way of looking for a source of an item only when there is no information provided about the source of an item. Next, indeed, this is used not only in merge. Hovewer, my feeling is that we can remove the check completely regardless of context. It does not matter if the item is created, merged or updated. Each time the source should be provided, and if not, then it means the source is unknown. However, it also means that the "client" of the API needs to be aware of that - this is why I asked in my first comment whether you see any threats, do any of you recall this behaviour is used in ingestion. From the comments it seems not the case. To put it another way: when would we need such a feature, i.e. when the source of an item is not provided - lookup the source by the host of the first accessibleAt url... if we see no reason then let's remove this.

dpancic commented 1 year ago

In GitLab by @vronk on Nov 18, 2022, 15:34

Yes, let's remove it.

dpancic commented 1 year ago

In GitLab by @vronk on Nov 18, 2022, 15:37

marked this issue as related to #172

dpancic commented 1 year ago

In GitLab by @KlausIllmayer on Dec 9, 2022, 17:31

Deployed the fix on stage and develop and tested it by creating a tool with an GitHub accessibleAt value. This worked well, I will look into it in more detail next week (trying to merge items) before deploying the fix on production.

dpancic commented 1 year ago

In GitLab by @laureD19 on Dec 12, 2022, 10:46

If that can help, I've tested the POST merge on stage with Github links in accessible at and also as external IDs. It seems to work well.

api/tools-services/merge?with=UhXkKH&with=F8fsFP -> https://sshoc-marketplace-stage.acdh-dev.oeaw.ac.at/tool-or-service/kWPtc5

and

api/tools-services/merge?with=NhuMKv&with=auDlxK -> https://sshoc-marketplace-stage.acdh-dev.oeaw.ac.at/tool-or-service/8eFcjm

dpancic commented 1 year ago

In GitLab by @KlausIllmayer on Dec 14, 2022, 16:27

thanks, this helped - my tests were also successful, pushing it now to production

dpancic commented 1 year ago

In GitLab by @KlausIllmayer on Dec 14, 2022, 16:30

mentioned in commit 1fe7d27889ff69d75ca75cab4d74c730166725a6

dpancic commented 1 year ago

In GitLab by @KlausIllmayer on Dec 14, 2022, 17:03

Fix committed on production, tested a draft item with a GitHub-accessibleAt, which worked fine.

dpancic commented 1 year ago

In GitLab by @KlausIllmayer on Dec 14, 2022, 17:03

changed the incident status to Resolved by closing the incident