Webscrape keeps rolling in empty pages.

thinkingoutloud55 commented 3 years ago

Hi Elaine,

Happy New Year, thanks for sharing your code publicly. I was using your code to scrape Glassdoor reviews, from my understanding the input file is a list of company names and that works fine but I couldn't figure out why during the process of scraping it keeps loading empty pages even though it's reached the last page and wouldn't move to the next company?

I am quite new to this and I find most web scraping tutorials do it in a way as if a person is browsing through a web page (from top to bottom, row by row), it's a bit different from your method. Is yours using API? Your method is a lot faster to run if I can make it work.

Looking forward to your reply. Thank you very much.

Millie

elainespak commented 3 years ago

Hi Millie,

Happy New Year! Thanks for letting me know about the issue. I just fixed the webscraping code files to resolve the error that you pointed out (please refer to the latest commit).

I was using your code to scrape Glassdoor reviews, from my understanding the input file is a list of company names

This is correct! I also changed the input part so that the user can manually define the list of company names (e.g., names = ['Apple Inc', 'Facebook Inc'])

but I couldn't figure out why during the process of scraping it keeps loading empty pages even though it's reached the last page and wouldn't move to the next company

The earlier code caused this problem due to the changes in Glassdoor's HTML structure. Line 52 of the webscrape.py file acts as a condition that forces the code to "stop" once it reaches an empty review page. The old condition no longer worked properly due to the changes on Glassdoor's end, so I added a new condition.

Is yours using API?

Nope. I think the tutorials you're referring to involve Selenium or other automation tools. My method simply loads the source text information by calling in the webpage URL, which is why you don't see the "browsing" or "scrolling" actions. It's also why it's faster.

Hope this was helpful!

thinkingoutloud55 commented 3 years ago

Hi Elaine,

Thank you very much for fixing the errors and detailed explanation. You are amazing ;D

By loading the source text information does it only retrieve the contents set by default? I am asking this is because apart from the English reviews, I want to collect non-English ones as well. Are they considered as some sort of hidden contents and do you recon to make an AJAX request for these? Glassdoor displays only English reviews by default and you code works perfectly on that, some suggestions would be helpful and I'll try to work it out the rest.

I will make sure to cite your code properly when I finish my project. Thank you.

Millie

elainespak commented 3 years ago

Hi Millie,

When I first created the code 1.5 years ago, it was able to collect all reviews regardless of language. But when I checked today, it seems like Glassdoor changed its interface so that a user can filter the reviews based on language:

Since my code only retrieves the contents set by default (as you said), it works only for English reviews. I'm not planning to add to add a feature for non-English reviews, but here's a suggestion:

Let's say you want to collect reviews from Facebook.

Default Glassdoor URL (language set to English by default): 'https://www.glassdoor.com/Reviews/Facebook-Reviews-E40772.htm'
Language set to "French": 'https://www.glassdoor.com/Reviews/Facebook-Reviews-E40772.htm?filter.iso3Language=fra&filter.employmentStatus=REGULAR&filter.employmentStatus=PART_TIME'
Language set to "German": 'https://www.glassdoor.com/Reviews/Facebook-Reviews-E40772.htm?filter.iso3Language=deu&filter.employmentStatus=REGULAR&filter.employmentStatus=PART_TIME'

As you can see, the differences lie in the last string: '?filter.iso3Language=OOO&filter.employmentStatus=REGULAR&filter.employmentStatus=PART_TIME' where OOO is a symbol for different languages. frafor French, deu for German, and so on. (For some reason, removing '&filter.employmentStatus=REGULAR&filter.employmentStatus=PART_TIME' does not properly filter the reviews based on language--I suggest you keep it)

When you call the website URL, you could add that new language filter string to collect non-English reviews as well.

Hope this helps!

thinkingoutloud55 commented 3 years ago

Hi Elaine,

Thank you for your help, I changed the URL by adding the additional string for each language and it worked! Much appriciated your work and kind explanations.

All the best, Millie

elainespak commented 3 years ago

Wonderful! Glad I could help. Best of luck on your project. :)

elainespak / glassdoor_aspect_based_sentiment_analysis

Webscrape keeps rolling in empty pages. #7