NikolaiT / GoogleScraper

A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.
https://scrapeulous.com/
Apache License 2.0
2.63k stars 736 forks source link

when cache file larger than 10,000, parser serp from caching is very very slow #79

Open leadscloud opened 9 years ago

leadscloud commented 9 years ago

when cache file larger than 10,000, parser serp from caching is very very slow.

for 12000 cache, it probably need to spend 2 hours.

for path in files:
        print('processing {num_cached} cached files...'.format(num_cached=num_cached), end='\r')
        # strip of the extension of the path if it has eny
        fname = os.path.split(path)[1]
        clean_filename = fname
        for ext in ALLOWED_COMPRESSION_ALGORITHMS:
            if fname.endswith(ext):
                clean_filename = fname.rstrip('.' + ext)

        job = mapping.get(clean_filename, None)

        if job:
            # We found a file that contains the keyword, search engine name and
            # searchmode that fits our description. Let's see if there is already
            # an record in the database and link it to our new ScraperSearch object.
            if Config['SCRAPING'].get('keyword_file'):
                serp = get_serp_from_database(session, job['query'], job['search_engine'], job['scrape_method'], job['page_number'])
            else:
                serp = None
            if not serp:
                serp = parse_again(fname, job['search_engine'], job['scrape_method'], job['query'])
            serp.scraper_searches.append(scraper_search)
            session.add(serp)

            if num_cached % 200 == 0:
                session.commit()

            store_serp_result(serp)
            num_cached += 1
            if job in scrape_jobs:
                scrape_jobs.remove(job)
NikolaiT commented 9 years ago

Yes this is very true. Sqlalchemy is the bottleneck here probably. What do you suggest?

(Currently I am writing 6 exams for university, so this is the reason I cannot work a lot on GoogleScraper right now. In one week I am done with exams and I can continue here :) )

leadscloud commented 9 years ago

Store keywords to database.