GeoNode / geonode

GeoNode is an open source platform that facilitates the creation, sharing, and collaborative use of geospatial data.
https://geonode.org/
Other
1.45k stars 1.13k forks source link

Review and hardening of harvesting engine #8982

Open giohappy opened 2 years ago

giohappy commented 2 years ago

Introduction

The harvesting engine has been introduced recently in the master branch, which will be delivered with GN 4.x. This service is in charge of harvesting resources from remote services, based on several configurations options driving its scheduling, the ingestion logic, and the management of updates.

It's based on several Celery mechanisms, which are employed to orchestrate and perform the tasks. It also supports the implementation of custom harvesters, beyond the ones implemented in GeoNode (WMS, GeoNode, ArcGIS Server).

It's also the engine behind the Remote Services. With the introduction of the harvesting engine, these have become a simplified interface on top of WMS, GeoNode, and ArcGIS harvester instances.

Issues

The GeoNode master demo instance has been used extensively to test the harvesters, by configuring several Remote Services and some other harvesters (configured through the Django admin). This stress testing revealed some fragilities in the management of harvester jobs and the execution of some of them. In particular, we noticed problems with:

As a result of the analysis, we want to implement any useful hardening to mitigate the reported problems and improve the reliability of the Harvesting engine.

italogsfernandes commented 2 years ago

@giohappy Besides the examples in the docs, do you have more harvesters sources that I can test/try?

gannebamm commented 2 years ago

On Friday (1st April) our https://atlas.thuenen.de GeoNode instance will be publicly available. It runs at 3.3.x

giohappy commented 2 years ago

@italogsfernandes

italogsfernandes commented 2 years ago

For the three cases to be analyzed:

giohappy commented 2 years ago

As we suspected the restarts are the main issue here. @italogsfernandes in theory a self-healing solution would be needed here, to let Celery restore the status of running harvester jobs, but it would be quite complex work. For the moment the simplest solution would be to cleanup the status of any "running" job after a restart. Scheduled harvesters will restart the jobs by themselves. Unscheduled jobs (e.g. the one created when harvesting resources for a Remote Sergice) will have to be re-run manually.

What's your opinion?

I would like to also hear an opinion from @ricardogsilva on this.

ricardogsilva commented 2 years ago

@giohappy I agree with you that the simple solution of manually cleaning up the status of any job that happened to be running when a restart had been performed is likely the way to go for now.

Implementing some sort of self-healing would be nice, but if it is for recovering from a hard restart of celery I think implementation would be a bit complex. Maybe we can simply have some sort of post-restart script that automatically resets the statuses back to READY and re-launches jobs immediately? Despite bringing this up I'm not 100% on whether this would be a good approach though. Sometimes requiring human intervention after a restart/reboot is not a bad idea - I guess we can discuss this further and try to come up with a better solution

giohappy commented 2 years ago

For the moment the proposal is to just do a clean-up of the jobs when celery restarts. Unscheduled jobs will have to be rerun manually.

giohappy commented 2 years ago

@ricardogsilva @afabiani the proposal from @italogsfernandes is to adopt the following for celery restarts:

celery --workdir /usr/src/geonode --app geonode.celery_app purge -f

if you think it would work let's create e PR for it.