gbif / crawler

The crawling pieces - ws, cli, coordinator
Apache License 2.0
4 stars 3 forks source link

Crawl scheduler stops working after a few days #19

Closed MattBlissett closed 2 years ago

MattBlissett commented 6 years ago

After a few days, the crawl scheduler gets stuck and no longer runs.

I've added a Nagios monitor to check for changes to its log file, so we can restart it when this happens (pkill scheduler; ./start-crawl-scheduler)

timrobertson100 commented 2 years ago

This is becoming more frequent and we should diagnose and fix this.

muttcg commented 2 years ago

I found a place where it gets stuck, so added extra logs in CrawlSchedulerService:

LOG.debug("datasetService.list(pageable); {}", pageable);
PagingResponse<Dataset> datasets = datasetService.list(pageable);
isEndOfRecords = datasets.isEndOfRecords();
LOG.debug("for (Dataset dataset : datasets.getResults())");

And after time the last logged line was: DEBUG [12-17 23:49:54,226+0000] [CrawlSchedulerService RUNNING] org.gbif.crawler.scheduler.CrawlSchedulerService: datasetService.list(pageable); PageableBase[offset=55160, limit=20]

So datasetService.list(pageable); gets stuck during pagination

muttcg commented 2 years ago

The issue appeared because of old gbif-api version, dataset client couldn't serialize new ENUM and threw exception, exception caused silent shutdown of scheduled thread, but didn't to crash whole app

Fix has been deployed to PROD

timrobertson100 commented 2 years ago

Nice find