Closed MattBlissett closed 2 years ago
This is becoming more frequent and we should diagnose and fix this.
I found a place where it gets stuck, so added extra logs in CrawlSchedulerService:
LOG.debug("datasetService.list(pageable); {}", pageable);
PagingResponse<Dataset> datasets = datasetService.list(pageable);
isEndOfRecords = datasets.isEndOfRecords();
LOG.debug("for (Dataset dataset : datasets.getResults())");
And after time the last logged line was:
DEBUG [12-17 23:49:54,226+0000] [CrawlSchedulerService RUNNING] org.gbif.crawler.scheduler.CrawlSchedulerService: datasetService.list(pageable); PageableBase[offset=55160, limit=20]
So datasetService.list(pageable);
gets stuck during pagination
The issue appeared because of old gbif-api version, dataset client couldn't serialize new ENUM and threw exception, exception caused silent shutdown of scheduled thread, but didn't to crash whole app
Fix has been deployed to PROD
Nice find
After a few days, the crawl scheduler gets stuck and no longer runs.
I've added a Nagios monitor to check for changes to its log file, so we can restart it when this happens (
pkill scheduler; ./start-crawl-scheduler
)