Closed christophfriedrich closed 4 years ago
I'm relatively sure that I did not expand the "All collections" section before taking the screenshot yesterday, so it's interesting to note that meanwhile the behaviour is that initially, 434 collections are listed and only after expanding the section that number is replaced with the 921. That's another sign that this is probably due to inconsistent database content (see also https://github.com/Open-EO/openeo-hub/issues/78#issuecomment-686645205)
Okay, the problem is the primary key of the collections table: It's set on service
+ api_version
+ id
. So a collection is identified e.g. by https://earthengine.openeo.org
+ 1.0.0-rc.2
+ COPERNICUS/S2
.
This was done to minimise changes (see also #56), because the service URL most likely never changes, and I thought the same of the api_version
field. But now it happened, the GEE driver's api_version
was changed from 1.0.0-rc.2
to 1.0.0
, causing duplicate entries to occur.
But the grouping of all raw documents into the individual backend entries is done on the backend
field, causing the old 1.0.0-rc.2
documents that were previously crawled to end up in the same aggregation as the fresh 1.0.0
ones, because they both belong to the https://earthengine.openeo.org/v1.0
backend.
And because one unsuccessful endpoint* is enough to deem the whole backend unsuccessful, GEE was flagged as such. https://github.com/Open-EO/openeo-hub/blob/6c64fdfb77db018b730baf549f5596e3694654c1/src/dbqueries.js#L15
* of any of the endpoints /
, /collections
, /processes
, /service_types
, /output_formats
, /file_formats
, /udf_runtimes
Questions arising from this:
api_version
field a one-time issue in the current dev phase or should that be treated as a use case that could happen regularly?service
+api_version
?For 2. I'd say yes, it kinda would've prevented this bug (when the service
+api_version
change was introduced it should've been changed anyway, I probably just oversaw it).
For 3. I'd say no, but how crawling errors are communicated to the user should be discussed anyway, which is why #23 exists.
Thanks for investigating.
- Was a changing
api_version
field a one-time issue in the current dev phase or should that be treated as a use case that could happen regularly?
That can happen regularly (like every x months or so)
- Should the grouping be changed to
service
+api_version
?
I don't fully understand that yet. Can you use the https://earthengine.openeo.org/v1.0 URL?
- Should it need more than just a single failed endpoint to cause flagging?
Fine with "no".
I don't fully understand that yet.
Assume a backend changes its api_version
. After crawling there will be two documents for the /
endpoint in the database's raw
table: one with 1.0.0-rc.2
(old, now has unsuccessfulCrawls=1) and one with 1.0.0
(new).
Now the difference is:
Grouping on backend
(i.e. https://earthengine.openeo.org/v1.0
):
-> 1 backend
Grouping on service
+api_version
:
-> 2 backends
Grouping on backend seems correct, but I guess the question is why crawling it doesn't drop old collections? It sounds like that's the original issue that on crawling the old data doesn't get removed or correctly updated, right?
That's right, and the cause was the same: Old data was removed based on the backend
field -- and because both old and new data had the same backend
value, nothing was deleted. I now changed the deletion step to service
+api_version
too, so this bug is fixed.
I tested the crawling several times and it worked both for the GEE case and also for EODC -- they changed to 1.0.0
today (at least I believe so as the deployed Hub still lists 1.0.0-rc.2
but the live backend reports 1.0.0
). So I guess after tonight's crawl the deployed Hub will list EODC incorrectly too. But as soon as this fix is deployed and the next crawling done, it will go away :)
I'm confident this works fine, so I merged it onto master
; feel free to deploy it whenever you've got the time (it's not super urgent IMO).
It seems fixed. I restarted the server this morning and couldn't reproduce any longer (although the server is on the dev branch, I think).
Right now, dev and master are identical. As long as you only pull when I tell you to do so you can leave it on dev :D
@m-mohr informed me that there's something wrong with how the Earth Engine driver is listed in the Hub:
The backend is reported as being unavailable, but it doubtlessly is available.
This bug seems to be due to the collection crawling: More than 900 collections are being reported, but GEE actually has "only" around 480, of which quite a few are rather new, so probably both the old and the new ones (~440 + ~480 = ~920) are floating around the database and causing errors...