Open leadscloud opened 9 years ago
Very good. Implemented. :)
I like especially the % 200, it makes things faster!
I fixxed the threading issue
sqlalchemy.exc.InvalidRequestError: This Session's transaction has been rolled back by a nested rollback() call. To begin a new transaction, issue Session.rollback() first.
with
engine = create_engine('sqlite:///' + db_path, echo=echo, connect_args={'check_same_thread': False})
in database.py
Your idea is bad:
try:
self.session.add(serp)
self.session.commit()
except:
return False
because we will NOT save results (will always catch the error).
It works now (at least for sqlite3)!
scraping.py
except self.requests.ConnectionError as ce:
logger.error('Network problem occurred {}'.format(ce))
raise ce
except self.requests.Timeout as te:
logger.error('Connection timeout {}'.format(te))
raise te
the raise Cause the program to stop
except self.requests.ConnectionError as ce:
logger.error('Network problem occurred {}'.format(ce))
return False
except self.requests.Timeout as te:
logger.error('Connection timeout {}'.format(te))
return False
if always raise except , it often caused the program to stop
Fixed.
I thought about a scraping policy. What do you think?
"""
GoogleScraper should be as robust as possible.
There are several conditions that may stop the scraping process.
- All proxies are detected and we cannot request further keywords => Stop.
- No internet connection => Stop.
- If the proxy is detected by the search engine we try to get another proxy from the pool and we call switch_proxy() => continue.
- If the proxy is detected by the search engine and there is no other proxy in the pool, we wait {search_engine}_proxy_detected_timeout seconds => continue.
+ If the proxy is detected again after the waiting time, we discard the proxy for the whole scrape.
"""
That's good
I met new problem. StopScrapingException is not a goold solution.
for a thread, if the scrape 5000keywords, and if the proxy is not stable, just in the middle period of the proxy can not be used. Under the current rules, the thread will stop.
we can set return to continue
def blocking_search(self, callback, *args, **kwargs):
"""Similar transports have the same search loop layout.
The SelScrape and HttpScrape classes have the same search loops. Just
the transport mechanism is quite different (In HttpScrape class we replace
the browsers functionality with our own for example).
Args:
callback: A callable with the search functionality.
args: Arguments for the callback
kwargs: Keyword arguments for the callback.
"""
for i, self.current_keyword in enumerate(self.keywords):
self.current_page = self.start_page_pos
for self.current_page in range(1, self.num_pages_per_keyword + 1):
# set the actual search code in the derived class
try:
if not callback(*args, **kwargs):
self.missed_keywords.add(self.current_keyword)
except StopScrapingException as e:
# Leave search when search engines detected us
# add the rest of the keywords as missed one
logger.critical(e)
self.missed_keywords.add(self.keywords[i])
continue
It's another case which I haven't programmed yet. We are right now talking about the stability of proxies. You say: if one proxy already processed 5000 requests and it suddenly stops, it's very likely that it's a temporary issue and it will continue to work. So there is no need to stop scraping.
This is correct. But the more common case is, that the proxy will work in the beginning (thus passing proxy_check()) and then stops to work completely. So we need to keep track of the proxy behaviour in attributes in the Proxy class in database.py and then react accordingly.
For example:
It's very complex to program a good strategy. Id need's time.
I will need a good base strategy, that the user can edit in the configuration.
to
when store serp
my modify
scraping.py if the result page have no serp, do not store it.
caching.py
to