NikolaiT / GoogleScraper

A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.
https://scrapeulous.com/
Apache License 2.0
2.64k stars 743 forks source link

randomly unable to load links from serp #81

Open elliotenglish opened 9 years ago

elliotenglish commented 9 years ago

I'm having an issue using GoogleScraper to repeatedly pull images off google. I'm calling scrape_with_config() with the attached config several times within the same script. It fails about 25% of the time I try to access the results from a search. It seems as if the sqlalchemy dblite connection dies. I've experienced this issue with both the pip and latest versions of the code.

code/error:


search=GoogleScraper.scrape_with_config(config)
for serp in search.serps:
    for link in serp.links:

...

File "scrape.py", line 61, in <module>
    for link in serp.links:

...

(orm_util.state_str(state), self.key)
sqlalchemy.orm.exc.DetachedInstanceError: Parent instance <SearchEngineResultsPage at 0x7fc98973bcf8> is not bound to a Session; lazy load operation of attribute 'links' cannot proceed

config:

            config = {
                'SCRAPING': {
                    'keyword': keyword,
                    'search_engines':'google'                                                                                   
                    'num_pages_for_keyword':10,
                    'search_type': 'image',
                    'scrape_method': 'selenium',
                },
                'SELENIUM': {
                    'sel_browser':'Firefox',
                },
                'GLOBAL': {
                    'do_caching': 'False'
                }
            }
pierrekin commented 9 years ago

I am also experiencing this issue.

update see next comment

It happened once (same error message) then I removed .scrapecache and google_scraper.db and the issue went away.

pierrekin commented 9 years ago

It seams this issue may actually be caused when the search tries to page beyond the last search results.