abjer / isds2020

Introduction to Social Data Science 2020 - a summer school course abjer.github.io/isds2020
58 stars 92 forks source link

Selenium and logging #44

Open peterlravn opened 4 years ago

peterlravn commented 4 years ago

We are using Selenium to lives crape a static website over a couple of hours. In exercise 6, it said that we are supposed to log our data collection process in our final exam. Are we supposed to log our data collection when using Selenium? We don't repeatedly request a website, so I'm not sure how to log our data.

jsr-p commented 4 years ago

hi @peterlravn , yes, you are supposed to log your data collection when using Selenium. The Connector class from the lecture will log each request that you make with Selenium automatically. A rule of thumb is to log each time you request a page and get some new HTML that you want to parse. How come it take a couple of hours to scrape 1 static page that you request once? :D

jesperhauch commented 4 years ago

We've scraped 5 hours of worth of data, but the requests have not been logged. The log has picked up lots of other requests where we didn't use Selenium. Can we 'write our way out of it' in our paper or should we collect the data once again?

ninibertelsen commented 4 years ago

We have the same problem with logging, it just does not log what we are doing. Maybe we are doing something wrong? We have tried with both:

import scraping_class logfile = 'log_exam.txt' connector = scraping_class.Connector(logfile)

and

driver = webdriver.Chrome(executable_path="/Users/ninibertelsen/Downloads/chromedriver", service_args=["--verbose", "--log-path=exam.log"])

Can you see any mistakes, or is there something we have to do manually as well?

All the best, Nini

PS sorry to hijack this issue, but I thought it was silly to make another one about the exact same thing.

jsr-p commented 4 years ago

hi everyone, it is important that you use the getmethod of the Connector class and not the get method of the webdriver.Chrome object. Consider the Connector class from the lectures. When using Selenium and then connector.get() the following method is used: image

The method also uses the get method of the webdriver.Chrome object. This is done in the line self.browser.get(url) # use selenium get method. But the difference here is that the following lines log the information to the log file. If you only use connector.browser.get() nothing will be written to the log file.

jsr-p commented 4 years ago

@jesperhauch I would scrape the data again just to practice using the Connector class in the correct way. But you could probably also just incorporate it into the limitations of your study :)

ninibertelsen commented 4 years ago

Thanks, I think the whole connector thing was very confusing, but I think I've got it now :-)

annalundsoe commented 4 years ago

Could you show an example for Selenium? What is 'self' supposed to be?