codeforboston / Legalhousing-Scrapple

Getting all the data off websites
0 stars 8 forks source link

asynchronous schedulable crawler jobs now working #13

Closed jayventi closed 6 years ago

jayventi commented 6 years ago

I have parallel manageable schedules running which control the crawler which could be multiple crawlers. Each crawler gets its own separate schedule process which can be started or stopped independently my name.

I changed the name of the endpoints the following starts the scheduler http://localhost:5555/start_spider_sch?scraper=craigslist

this terminates the scheduler http://localhost:5555/stop_spider_sch?scraper=craigslist

However, it turns out we do have the listing_id duplication problem I thought this might occur will need to come up with a fix, I favor keeping a list of listing_id setpoints and having the crawlers themselves use this to filter out already processed ids.