I have parallel manageable schedules running which control the crawler which could be multiple crawlers. Each crawler gets its own separate schedule process which can be started or stopped independently my name.
However, it turns out we do have the listing_id duplication problem I thought this might occur will need to come up with a fix, I favor keeping a list of listing_id setpoints and having the crawlers themselves use this to filter out already processed ids.
I have parallel manageable schedules running which control the crawler which could be multiple crawlers. Each crawler gets its own separate schedule process which can be started or stopped independently my name.
I changed the name of the endpoints the following starts the scheduler http://localhost:5555/start_spider_sch?scraper=craigslist
this terminates the scheduler http://localhost:5555/stop_spider_sch?scraper=craigslist
However, it turns out we do have the listing_id duplication problem I thought this might occur will need to come up with a fix, I favor keeping a list of listing_id setpoints and having the crawlers themselves use this to filter out already processed ids.