Review and hardening of harvesting engine

giohappy commented 2 years ago

Introduction

The harvesting engine has been introduced recently in the master branch, which will be delivered with GN 4.x. This service is in charge of harvesting resources from remote services, based on several configurations options driving its scheduling, the ingestion logic, and the management of updates.

It's based on several Celery mechanisms, which are employed to orchestrate and perform the tasks. It also supports the implementation of custom harvesters, beyond the ones implemented in GeoNode (WMS, GeoNode, ArcGIS Server).

It's also the engine behind the Remote Services. With the introduction of the harvesting engine, these have become a simplified interface on top of WMS, GeoNode, and ArcGIS harvester instances.

Issues

The GeoNode master demo instance has been used extensively to test the harvesters, by configuring several Remote Services and some other harvesters (configured through the Django admin). This stress testing revealed some fragilities in the management of harvester jobs and the execution of some of them. In particular, we noticed problems with:

some jobs for the retrieval harvestable resources get stuck
some jobs for the retrieval of updates to the configured harvestable resources get stuck
the state of the harvesters is not always restored completely after forcing the ending of ongoing jobs or scheduled jobs

Analysis and hardening

We want to investigate if any of the following factors can lead failures with the management of harvesting jobs, and in what measure:
- restarts of GeoNode and/or Celery Docker services (for ex. during redeploys)
- network latencies/errors between GeoNode, RabbitMQ and Celery workers
- number of concurrently scheduled jobs

As a result of the analysis, we want to implement any useful hardening to mitigate the reported problems and improve the reliability of the Harvesting engine.

italogsfernandes commented 2 years ago

@giohappy Besides the examples in the docs, do you have more harvesters sources that I can test/try?

gannebamm commented 2 years ago

On Friday (1st April) our https://atlas.thuenen.de GeoNode instance will be publicly available. It runs at 3.3.x

giohappy commented 2 years ago

@italogsfernandes

https://risk.spc.int/ (GeoNode harvester)
http://ihp-wins.unesco.org (GeoNode, but I'm not 100% sure this version is compliant with the expected 3.3.x version for the harvester)
https://www.geonode-gfdrrlab.org/geoserver/ows (WMS)
http://ihp-wins.unesco.org/geoserver/ows (WMS)

italogsfernandes commented 2 years ago

For the three cases to be analyzed:

restarts of GeoNode and/or Celery Docker services (for ex. during redeploys):
- The job gets stuck, and after restarting celery it is not started again, so we need to "re-scan" the harvest source. In this case, if I remember well, celery raises an exception when a termination process is started, this exception can be used to handle some cases. Also, when starting the celery job, an automatic job for restarting the stopped ones can be added.
network latencies/errors between GeoNode, RabbitMQ and Celery workers
- It was not verified.
number of concurrently scheduled jobs
- I didn't find any problem here.

giohappy commented 2 years ago

As we suspected the restarts are the main issue here. @italogsfernandes in theory a self-healing solution would be needed here, to let Celery restore the status of running harvester jobs, but it would be quite complex work. For the moment the simplest solution would be to cleanup the status of any "running" job after a restart. Scheduled harvesters will restart the jobs by themselves. Unscheduled jobs (e.g. the one created when harvesting resources for a Remote Sergice) will have to be re-run manually.

What's your opinion?

I would like to also hear an opinion from @ricardogsilva on this.

ricardogsilva commented 2 years ago

@giohappy I agree with you that the simple solution of manually cleaning up the status of any job that happened to be running when a restart had been performed is likely the way to go for now.

Implementing some sort of self-healing would be nice, but if it is for recovering from a hard restart of celery I think implementation would be a bit complex. Maybe we can simply have some sort of post-restart script that automatically resets the statuses back to READY and re-launches jobs immediately? Despite bringing this up I'm not 100% on whether this would be a good approach though. Sometimes requiring human intervention after a restart/reboot is not a bad idea - I guess we can discuss this further and try to come up with a better solution

giohappy commented 2 years ago

For the moment the proposal is to just do a clean-up of the jobs when celery restarts. Unscheduled jobs will have to be rerun manually.

giohappy commented 2 years ago

@ricardogsilva @afabiani the proposal from @italogsfernandes is to adopt the following for celery restarts:

celery --workdir /usr/src/geonode --app geonode.celery_app purge -f

if you think it would work let's create e PR for it.

GeoNode / geonode

Review and hardening of harvesting engine #8982

Introduction

Issues

Analysis and hardening