cshaley / crawl-cl

Aggregate relevant craigslist search results into a pandas DataFrame
http://craigslist.org
0 stars 1 forks source link

Too darn slow #2

Open cshaley opened 7 years ago

cshaley commented 7 years ago

This program is too slow when querying from multiple subdomains (different cities). How can we speed it up?

  1. Parallelize requests calls - there is no reason why we can't pull from 8 different sites at once on an 8-cpu machine. The requests calls (loading webpages) is by far and away the slowest part of this program.
  2. Reformat main loop - it is not designed incredibly well.
  3. General optimization of data flow and website parsing.
mgruben commented 7 years ago

(1) The host website may be purposefully slowing your access, since making numerous fast calls from the same IP address is, in practice, indistinguishable from DoS.

(2) If (1) is not true, consider using workerpool as discussed in this relevant StackExchange question

cshaley commented 7 years ago

1) The main thing making it slow is that it submits a request for a webpage, waits for it to load, and then moves on. Ideally, I would be able to do that a few times in parallel or in some sort of asynchronous form.

2) I had trouble getting it to work at all using both joblib's Parallel(delayed) and Multiprocessing pools. Perhaps that strategy will work better.

mgruben commented 7 years ago

I agree that submitting numerous simultaneous requests should speed things up, so long as Python knows when they've all returned successfully