Collection crawling seems to be broken

christophfriedrich commented 4 years ago

@m-mohr informed me that there's something wrong with how the Earth Engine driver is listed in the Hub:

The backend is reported as being unavailable, but it doubtlessly is available.

This bug seems to be due to the collection crawling: More than 900 collections are being reported, but GEE actually has "only" around 480, of which quite a few are rather new, so probably both the old and the new ones (~440 + ~480 = ~920) are floating around the database and causing errors...

christophfriedrich commented 4 years ago

I'm relatively sure that I did not expand the "All collections" section before taking the screenshot yesterday, so it's interesting to note that meanwhile the behaviour is that initially, 434 collections are listed and only after expanding the section that number is replaced with the 921. That's another sign that this is probably due to inconsistent database content (see also https://github.com/Open-EO/openeo-hub/issues/78#issuecomment-686645205)

christophfriedrich commented 4 years ago

Okay, the problem is the primary key of the collections table: It's set on service + api_version + id. So a collection is identified e.g. by https://earthengine.openeo.org + 1.0.0-rc.2 + COPERNICUS/S2.

This was done to minimise changes (see also #56), because the service URL most likely never changes, and I thought the same of the api_version field. But now it happened, the GEE driver's api_version was changed from 1.0.0-rc.2 to 1.0.0, causing duplicate entries to occur.

But the grouping of all raw documents into the individual backend entries is done on the backend field, causing the old 1.0.0-rc.2 documents that were previously crawled to end up in the same aggregation as the fresh 1.0.0 ones, because they both belong to the https://earthengine.openeo.org/v1.0 backend.

And because one unsuccessful endpoint* is enough to deem the whole backend unsuccessful, GEE was flagged as such. https://github.com/Open-EO/openeo-hub/blob/6c64fdfb77db018b730baf549f5596e3694654c1/src/dbqueries.js#L15

* of any of the endpoints /, /collections, /processes, /service_types, /output_formats, /file_formats, /udf_runtimes

Questions arising from this:

Was a changing api_version field a one-time issue in the current dev phase or should that be treated as a use case that could happen regularly?
Should the grouping be changed to service+api_version?
Should it need more than just a single failed endpoint to cause flagging?

For 2. I'd say yes, it kinda would've prevented this bug (when the service+api_version change was introduced it should've been changed anyway, I probably just oversaw it).

For 3. I'd say no, but how crawling errors are communicated to the user should be discussed anyway, which is why #23 exists.

m-mohr commented 4 years ago

Thanks for investigating.

Was a changing api_version field a one-time issue in the current dev phase or should that be treated as a use case that could happen regularly?

That can happen regularly (like every x months or so)

Should the grouping be changed to service+api_version?

I don't fully understand that yet. Can you use the https://earthengine.openeo.org/v1.0 URL?

Should it need more than just a single failed endpoint to cause flagging?

Fine with "no".

christophfriedrich commented 4 years ago

I don't fully understand that yet.

Assume a backend changes its api_version. After crawling there will be two documents for the / endpoint in the database's raw table: one with 1.0.0-rc.2 (old, now has unsuccessfulCrawls=1) and one with 1.0.0 (new).

Now the difference is:

Grouping on backend (i.e. https://earthengine.openeo.org/v1.0):

grafik

-> 1 backend

Grouping on service+api_version:

grafik

-> 2 backends

m-mohr commented 4 years ago

Grouping on backend seems correct, but I guess the question is why crawling it doesn't drop old collections? It sounds like that's the original issue that on crawling the old data doesn't get removed or correctly updated, right?

christophfriedrich commented 4 years ago

That's right, and the cause was the same: Old data was removed based on the backend field -- and because both old and new data had the same backend value, nothing was deleted. I now changed the deletion step to service+api_version too, so this bug is fixed.

I tested the crawling several times and it worked both for the GEE case and also for EODC -- they changed to 1.0.0 today (at least I believe so as the deployed Hub still lists 1.0.0-rc.2 but the live backend reports 1.0.0). So I guess after tonight's crawl the deployed Hub will list EODC incorrectly too. But as soon as this fix is deployed and the next crawling done, it will go away :)

christophfriedrich commented 4 years ago

I'm confident this works fine, so I merged it onto master; feel free to deploy it whenever you've got the time (it's not super urgent IMO).

m-mohr commented 4 years ago

It seems fixed. I restarted the server this morning and couldn't reproduce any longer (although the server is on the dev branch, I think).

christophfriedrich commented 4 years ago

Right now, dev and master are identical. As long as you only pull when I tell you to do so you can leave it on dev :D

Open-EO / openeo-hub

Collection crawling seems to be broken #79