Open TheFifthFreedom opened 9 years ago
It's a bit of a late answer, but I just put my nose into this project.
I also do not really understand how @NikolaiT intended to design the num_worker variable the threads are not being helpful in Selenium mode. I reworked a bit the loop and following code such as it prevents the threads from opening all the windows at the same time, but based on the amount specified in num_worker.
num_worker = 0
for search_engine in search_engines:
for proxy in proxies:
# for worker in range(num_workers):
num_worker += 1
workers.put(
ScrapeWorkerFactory(
config,
cache_manager=cache_manager,
mode=method,
proxy=proxy,
search_engine=search_engine,
session=session,
db_lock=db_lock,
cache_lock=cache_lock,
scraper_search=scraper_search,
captcha_lock=captcha_lock,
progress_queue=q,
browser_num=num_worker
)
)
# here we look for suitable workers
# for all jobs created.
for job in scrape_jobs:
while True:
worker = workers.get()
if worker.is_suitabe(job):
worker.add_job(job)
workers.put(worker)
break
threads = []
while not workers.empty():
worker = workers.get()
thread = worker.get_worker()
if thread:
threads.append(thread)
# this is the old code
# for t in threads:
# t.join()
# changed for the following:
num_thread = 0
while num_thread <= threads.__len__():
for t in threads[num_thread:num_thread + num_workers]:
t.start()
for t in threads[num_thread:num_thread + num_workers]:
t.join()
num_thread += num_workers
# after threads are done, stop the progress queue.
It's already working much better in my opinion.
I've noticed that in the loop to create
ScrapeWorkerFactory
s incore.py
, there's a line that loops though every proxy in the given file (if one chooses to use one), which ends up creating more browser instances than you might limit with your config'snum_workers
:Not only is this not the behavior we want, it might end up crashing your machine if you have a set of, say, 100 proxies for instance. I believe one solution to this problem would be to remove the loop entirely and pick a proxy every time we're looping though
num_workers
:What do you think @NikolaiT ?