Open giohappy opened 2 years ago
@giohappy Besides the examples in the docs, do you have more harvesters sources that I can test/try?
On Friday (1st April) our https://atlas.thuenen.de GeoNode instance will be publicly available. It runs at 3.3.x
@italogsfernandes
For the three cases to be analyzed:
As we suspected the restarts are the main issue here. @italogsfernandes in theory a self-healing solution would be needed here, to let Celery restore the status of running harvester jobs, but it would be quite complex work. For the moment the simplest solution would be to cleanup the status of any "running" job after a restart. Scheduled harvesters will restart the jobs by themselves. Unscheduled jobs (e.g. the one created when harvesting resources for a Remote Sergice) will have to be re-run manually.
What's your opinion?
I would like to also hear an opinion from @ricardogsilva on this.
@giohappy I agree with you that the simple solution of manually cleaning up the status of any job that happened to be running when a restart had been performed is likely the way to go for now.
Implementing some sort of self-healing would be nice, but if it is for recovering from a hard restart of celery I think implementation would be a bit complex. Maybe we can simply have some sort of post-restart script that automatically resets the statuses back to READY and re-launches jobs immediately? Despite bringing this up I'm not 100% on whether this would be a good approach though. Sometimes requiring human intervention after a restart/reboot is not a bad idea - I guess we can discuss this further and try to come up with a better solution
For the moment the proposal is to just do a clean-up of the jobs when celery restarts. Unscheduled jobs will have to be rerun manually.
@ricardogsilva @afabiani the proposal from @italogsfernandes is to adopt the following for celery restarts:
celery --workdir /usr/src/geonode --app geonode.celery_app purge -f
if you think it would work let's create e PR for it.
Introduction
The harvesting engine has been introduced recently in the master branch, which will be delivered with GN 4.x. This service is in charge of harvesting resources from remote services, based on several configurations options driving its scheduling, the ingestion logic, and the management of updates.
It's based on several Celery mechanisms, which are employed to orchestrate and perform the tasks. It also supports the implementation of custom harvesters, beyond the ones implemented in GeoNode (WMS, GeoNode, ArcGIS Server).
It's also the engine behind the Remote Services. With the introduction of the harvesting engine, these have become a simplified interface on top of WMS, GeoNode, and ArcGIS harvester instances.
Issues
The GeoNode master demo instance has been used extensively to test the harvesters, by configuring several Remote Services and some other harvesters (configured through the Django admin). This stress testing revealed some fragilities in the management of harvester jobs and the execution of some of them. In particular, we noticed problems with:
the state of the harvesters is not always restored completely after forcing the ending of ongoing jobs or scheduled jobs
Analysis and hardening
We want to investigate if any of the following factors can lead failures with the management of harvesting jobs, and in what measure:
As a result of the analysis, we want to implement any useful hardening to mitigate the reported problems and improve the reliability of the Harvesting engine.