MatthewChatham / glassdoor-review-scraper

Scrape reviews from Glassdoor
BSD 2-Clause "Simplified" License
177 stars 252 forks source link

pagingControls Error #14

Open wangrunzu opened 5 years ago

wangrunzu commented 5 years ago

I got the following error about the paging control when I try to scrap the data.

python.exe main.py --headless --url "https://www.glassdoor.com/Reviews/Walmart-Reviews-E715.htm" --limit 100 -f test.csv

2019-05-31 15:06:49,643 INFO 377 :main.py(17796) - Configuring browser

DevTools listening on ws://127.0.0.1:50831/devtools/browser/8c7890e8-fe24-41f7-b77f-d22dae3f6c3e 2019-05-31 15:06:51,700 INFO 419 :main.py(17796) - Scraping up to 100 reviews. 2019-05-31 15:06:51,717 INFO 358 :main.py(17796) - Signing in to **@ou.edu 2019-05-31 15:06:55,478 INFO 339 :main.py(17796) - Navigating to company reviews 2019-05-31 15:07:08,137 INFO 286 :main.py(17796) - Extracting reviews from page 1 2019-05-31 15:07:08,200 INFO 291 :main.py(17796) - Found 10 reviews on page 1 2019-05-31 15:07:08,677 INFO 297 :main.py(17796) - Scraped data for "The Best in Retail"(Thu May 30 2019 20:24:44 GMT-0500 (Central Daylight Time)) 2019-05-31 15:07:09,171 INFO 297 :main.py(17796) - Scraped data for "Walmart needs to bring worker dignity back into focus"(Wed May 29 2019 18:04:43 GMT-0500 (Central Daylight Time)) 2019-05-31 15:07:09,673 INFO 297 :main.py(17796) - Scraped data for "Great for college students"(Thu May 30 2019 12:25:57 GMT-0500 (Central Daylight Time)) 2019-05-31 15:07:10,042 INFO 297 :main.py(17796) - Scraped data for "Retail"(Thu May 30 2019 17:09:02 GMT-0500 (Central Daylight Time)) 2019-05-31 15:07:10,497 INFO 297 :main.py(17796) - Scraped data for "walmart"(Mon May 27 2019 17:17:41 GMT-0500 (Central Daylight Time)) 2019-05-31 15:07:10,966 INFO 297 :main.py(17796) - Scraped data for "Maintenance is well taken care of"(Tue May 28 2019 08:32:17 GMT-0500 (Central Daylight Time)) 2019-05-31 15:07:11,437 INFO 297 :main.py(17796) - Scraped data for "It was the best job that I had to be honest"(Wed May 29 2019 20:29:39 GMT-0500 (Central Daylight Time)) 2019-05-31 15:07:11,896 INFO 297 :main.py(17796) - Scraped data for "Great"(Wed May 29 2019 20:36:02 GMT-0500 (Central Daylight Time)) 2019-05-31 15:07:12,281 INFO 297 :main.py(17796) - Scraped data for "floater pharmacist"(Wed May 29 2019 21:10:58 GMT-0500 (Central Daylight Time)) 2019-05-31 15:07:12,708 INFO 297 :main.py(17796) - Scraped data for "cashier"(Wed May 29 2019 23:11:49 GMT-0500 (Central Daylight Time)) Traceback (most recent call last): File "main.py", line 461, in main() File "main.py", line 446, in main while more_pages() and\ File "main.py", line 314, in more_pages paging_control = browser.find_element_by_class_name('pagingControls') File "C:\Users\wang0040\AppData\Local\Continuum\miniconda3\envs\Default\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 564, in find_element_by_class_name return self.find_element(by=By.CLASS_NAME, value=name) File "C:\Users\wang0040\AppData\Local\Continuum\miniconda3\envs\Default\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 978, in find_element 'value': value})['value'] File "C:\Users\wang0040\AppData\Local\Continuum\miniconda3\envs\Default\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute self.error_handler.check_response(response) File "C:\Users\wang0040\AppData\Local\Continuum\miniconda3\envs\Default\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"class name","selector":"pagingControls"} (Session info: headless chrome=74.0.3729.169) (Driver info: chromedriver=74.0.3729.6 (255758eccf3d244491b8a1317aa76e1ce10d57e9-refs/branch-heads/3729@{#29}),platform=Windows NT 6.1.7601 SP1 x86_64)

I also got the No Such Element Exception #8 error, but overcoming it by hide the scrape_years part. I do not think this action cause the above issue but I am not sure.

jhatamyar commented 5 years ago

I am suddenly getting the exact same exception error using chromedriver 73.0.3683.6 on Mac OS X 10.13.6. The code was working 100% perfectly a few weeks ago. I am looking into get_current_page() as I'm curious if find_elements by class name or xpath might be the problem, but I am a total beginner with selenium. Hoping the author can help.

MatthewChatham commented 5 years ago

Thanks folks, I may have time to look at this in the coming week. But if you're able to figure it out and make a PR to fix, I'll merge it!

guoruijiao commented 5 years ago

I'm seeing the exact same error as above. It would be great if this can be resolved.

heraldnithesh commented 5 years ago

Hi, Is this resolved ?

batordavid commented 5 years ago

Replacing some line of codes helped me.

Original (3 places in the codes): paging_control = browser.find_element_by_class_name('pagingControls') Updated: paging_control = browser.find_element_by_css_selector('.eiReviewsEIReviewsPageContainerStylespagination.noTabover.mt')

Original (2 places in the codes): next_ = paging_control.find_element_by_classname('next') Updated: next = paging_control.find_element_by_class_name('paginationPaginationStylenext')

tsp2123 commented 5 years ago

Hey, so does anyone have an issue where they fix the paging_control options but it breaks later on? I'm trying to scrape around 30k worth of data. And the code keeps breaking for me on around p176. I used the following for paging_control

` def more_pages(): paging_control = browser.find_element_by_cssselector('.eiReviewsEIReviewsPageContainerStylespagination.noTabover.mt') next = paging_control.find_element_by_classname('paginationPaginationStylenext') try: next.find_element_by_tag_name('a') return True except selenium.common.exceptions.NoSuchElementException: return False

def go_to_next_page(): logger.info(f'Going to page {page[0] + 1}') paging_control = browser.find_element_by_classname('paginationPaginationStylepagination') next = paging_control.find_element_by_class_name( 'paginationPaginationStylenext').find_element_by_tagname('a') browser.get(next.get_attribute('href')) time.sleep(1) page[0] = page[0] + 1

`

I'm messing around with both to see what works but my code keeps breaking not even a quarter way through the scraping. Does anyone have a work around?

carlotorniai commented 4 years ago

Hi all I've tried both suggestions and still the code breaks. Any clue? Traceback below: Traceback (most recent call last): File "main.py", line 483, in main() File "main.py", line 468, in main while more_pages() and\ File "main.py", line 315, in more_pages paging_control = browser.find_element_by_css_selector('.eiReviewsEIReviewsPageContainerStylespagination.noTabover.mt') File "/usr/local/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 598, in find_element_by_css_selector return self.find_element(by=By.CSS_SELECTOR, value=css_selector) File "/usr/local/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 978, in find_element 'value': value})['value'] File "/usr/local/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute self.error_handler.check_response(response) File "/usr/local/lib/python3.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":".eiReviewsEIReviewsPageContainerStylespagination.noTabover.mt"} (Session info: headless chrome=79.0.3945.130)

carlotorniai commented 4 years ago

Getting the latest code form MuhammadMehran pull request fixed the issue.

EdiLacic123 commented 4 years ago

@carlotorniai Could you post the code by any chance? I have been trying to fix the same issue as well. Thanks

carlotorniai commented 4 years ago

@EdiLacic123 just grab the main.py, test.py and schema. py from this pull request: https://github.com/MatthewChatham/glassdoor-review-scraper/pull/37/files