abjer / isds2020

Introduction to Social Data Science 2020 - a summer school course abjer.github.io/isds2020
58 stars 92 forks source link

Selenium won't log #49

Closed jesperhauch closed 4 years ago

jesperhauch commented 4 years ago

Hi

We're trying to log requests using selenium through the connector class, but the log isn't updating. We're connecting to a url via a request and then we use selenium to change page and save the whole HTML for every page. The code runs fine, but the log isn't catching the selenium connections. We did a test-run with 5 iterations where the log got everything and the log only contains these. It doesn't seem like it updates when we do runs after the test-run.

I've attached our code where we initialize the Connector with selenium: image

jsr-p commented 4 years ago

hi @jesperhauch , remember to only use the Connector's get method and not the get method of the browser object. Are you sure that you only use "connector.get()" and not "browser.get()"?

jsr-p commented 4 years ago

see #44

jesperhauch commented 4 years ago

Thanks for the quick response. The only time we use connector.get() is when we connect to the website. The rest of the time we use functions pointing to the browser (accept cookies, change page, etc). EDIT: We got the logging to work now by using connector.browser. I tried making a test-run where the page is turned 10 times but the log is only catching the connection once. I've run the test-run twice, which is why there's two entries as below. I imagined the log would have entries for every page when using Selenium? image

annalundsoe commented 4 years ago
Skærmbillede 2020-08-24 kl  11 35 36

The dataframe is empty when I try to use the log, I don't know if it has something to do with the driver? When I'm using a path to chrome driver python requests a path to gecko driver (as I have provided). Do you know what might be the issue?

jsr-p commented 4 years ago

@jesperhauch, the Connector object only writes to the log-file when you use the get method. If you interact with the browser in some other way than using the get method nothing will be written to the log file.

annalundsoe commented 4 years ago

I don't follow; I have to interact with the driver to go to the page (line 3-6 in my code)? What's the alternative?

jsr-p commented 4 years ago

hi @annalundsoe, the Connector class expects that you use Firefox and not Chrome. See the highlighted line in this screenshot: image You can change this by substituting Chrome for Firefox. Does your browser open when you compute connector.get(url, 'Infomedia')?

annalundsoe commented 4 years ago
Skærmbillede 2020-08-24 kl  13 10 26

I've tried substituting it with Chrome now, but now the code won't run. Before Firefox opened with the same piece of code.

annalundsoe commented 4 years ago

This assertion error still comes up: 'AssertionError: You need to insert a valid path2selenium the path to your geckodriver. You can download the latest geckodriver here: https://github.com/mozilla/geckodriver/releases'

jsr-p commented 4 years ago

@annalundsoe, if you choose to use chrome you will have to download the chromedriver. If it worked before with firefox I would just go with firefox. Try to delete all the lines with driver and only use the connector object. Does it still not log the get calls to the log file?

annalundsoe commented 4 years ago

I've already downloaded the chrome driver and it works perfectly fine for everything regarding selenium.

Won't it affect the rest of my code/scraping if I switch to Firefox?

No, I'm afraid it still just prints this error message:

Skærmbillede 2020-08-24 kl  13 35 58
jsr-p commented 4 years ago

@annalundsoe, the AssertionError tells you that the filepath to the driver you specified does not exist. If Firefox works, then use that :) The most important thing is that you log your data collection to a logfile. You can also use the logfile option in Selenium directly if you want to sidestep the Connector class, see #48. If you are still having troubles you can come by SODAS, CSS 1.2.29, and we'll fix it :)

annalundsoe commented 4 years ago

the entrance to sodas is closed :)

jsr-p commented 4 years ago
import requests,os,time
def ratelimit():
    "A function that handles the rate of your calls."
    time.sleep(0.5) # sleep one second.

class Connector():
  def __init__(self,
               logfile,
               overwrite_log=False,
               connector_type='requests',
               session=False,
               path2selenium='',
               n_tries = 5,
               timeout=30):
    """This Class implements a method for reliable connection to the internet and monitoring. 
    It handles simple errors due to connection problems, and logs a range of information for basic quality assessments

    Keyword arguments:
    logfile -- path to the logfile
    overwrite_log -- bool, defining if logfile should be cleared (rarely the case). 
    connector_type -- use the 'requests' module or the 'selenium'. Will have different since the selenium webdriver does not have a similar response object when using the get method, and monitoring the behavior cannot be automated in the same way.
    session -- requests.session object. For defining custom headers and proxies.
    path2selenium -- str, sets the path to the geckodriver needed when using selenium.
    n_tries -- int, defines the number of retries the *get* method will try to avoid random connection errors.
    timeout -- int, seconds the get request will wait for the server to respond, again to avoid connection errors.
    """

    ## Initialization function defining parameters. 
    self.n_tries = n_tries # For avoiding triviel error e.g. connection errors, this defines how many times it will retry.
    self.timeout = timeout # Defining the maximum time to wait for a server to response.
    ## not implemented here, if you use selenium.
    if connector_type=='selenium':
      assert path2selenium!='', "You need to specify the path to you geckodriver if you want to use Selenium"
      from selenium import webdriver 
      ## HIN download the latest geckodriver here: https://github.com/mozilla/geckodriver/releases
      assert os.path.isfile(path2selenium),'You need to insert a valid path2selenium the path to your geckodriver. You can download the latest geckodriver here: https://github.com/mozilla/geckodriver/releases'
      self.browser = webdriver.Chrome(executable_path=path2selenium) # start the browser with a path to the geckodriver.

    self.connector_type = connector_type # set the connector_type

    if session: # set the custom session
      self.session = session
    else:
      self.session = requests.session()
    self.logfilename = logfile # set the logfile path
    ## define header for the logfile
    header = ['id','project','connector_type','t', 'delta_t', 'url', 'redirect_url','response_size', 'response_code','success','error']
    if os.path.isfile(logfile):        
      if overwrite_log==True:
        self.log = open(logfile,'w')
        self.log.write(';'.join(header))
      else:
        self.log = open(logfile,'a')
    else:
      self.log = open(logfile,'w')
      self.log.write(';'.join(header))
    ## load log 
    with open(logfile,'r') as f: # open file

      l = f.read().split('\n') # read and split file by newlines.
      ## set id
      if len(l)<=1:
        self.id = 0
      else:
        self.id = int(l[-1][0])+1

  def get(self,url,project_name):
    """Method for connector reliably to the internet, with multiple tries and simple error handling, as well as default logging function.
    Input url and the project name for the log (i.e. is it part of mapping the domain, or is it the part of the final stage in the data collection).

    Keyword arguments:
    url -- str, url
    project_name -- str, Name used for analyzing the log. Use case could be the 'Mapping of domain','Meta_data_collection','main data collection'. 
    """

    project_name = project_name.replace(';','-') # make sure the default csv seperator is not in the project_name.
    if self.connector_type=='requests': # Determine connector method.
      for _ in range(self.n_tries): # for loop defining number of retries with the requests method.
        ratelimit()
        t = time.time()
        try: # error handling 
          response = self.session.get(url,timeout = self.timeout) # make get call - timeout is to stop after x seconds if no answer 

          err = '' # define python error variable as empty assumming success.
          success = True # define success variable
          redirect_url = response.url # log current url, after potential redirects 
          dt = t - time.time() # define delta-time waiting for the server and downloading content.
          size = len(response.text) # define variable for size of html content of the response.
          response_code = response.status_code # log status code.
          ## log...
          call_id = self.id # get current unique identifier for the call
          self.id+=1 # increment call id
          #['id','project_name','connector_type','t', 'delta_t', 'url', 'redirect_url','response_size', 'response_code','success','error']
          row = [call_id,project_name,self.connector_type,t,dt,url,redirect_url,size,response_code,success,err] # define row to be written in the log.
          self.log.write('\n'+';'.join(map(str,row))) # write log. - new line, join each element by ;, convert each element to str 
          self.log.flush()
          return response,call_id # return response and unique identifier.

        except Exception as e: # define error condition
          err = str(e) # python error
          response_code = '' # blank response code 
          success = False # call success = False
          size = 0 # content is empty.
          redirect_url = '' # redirect url empty 
          dt = t - time.time() # define delta t

          ## log...
          call_id = self.id # define unique identifier
          self.id+=1 # increment call_id

          row = [call_id,project_name,self.connector_type,t,dt,url,redirect_url,size,response_code,success,err] # define row
          self.log.write('\n'+';'.join(map(str,row))) # write row to log.
          self.log.flush()
    else:
      t = time.time()
      ratelimit()
      self.browser.get(url) # use selenium get method
      ## log
      call_id = self.id # define unique identifier for the call. 
      self.id+=1 # increment the call_id
      err = '' # blank error message
      success = '' # success blank
      redirect_url = self.browser.current_url # redirect url.
      dt = t - time.time() # get time for get method ... NOTE: not necessarily the complete load time.
      size = len(self.browser.page_source) # get size of content ... NOTE: not necessarily correct, since selenium works in the background, and could still be loading.
      response_code = '' # empty response code.
      row = [call_id,project_name,self.connector_type,t,dt,url,redirect_url,size,response_code,success,err] # define row 
      self.log.write('\n'+';'.join(map(str,row))) # write row to log file.
      self.log.flush()
    # Using selenium it will not return a response object, instead you should call the browser object of the connector.
    ## connector.browser.page_source will give you the html.
      return None,call_id
annalundsoe commented 4 years ago

Hey again,

I tried to implement an exception for an alert pop up as we discussed, but the code won't move past the exception;

try: press_button=browser.find_element_by_xpath('/html/body/div[8]/div[2]/a') press_button.click() except: print('NoSuchElementException')

Is the code wrong?

annalundsoe commented 4 years ago
Skærmbillede 2020-08-24 kl  16 16 27
jsr-p commented 4 years ago

hi @annalundsoe, it is hard to tell from the code alone. Do you get an exception when trying to click the button or? Try to use this command to click the button:

browser.execute_script("arguments[0].click();", press_button)

instead of press_button.click().

annalundsoe commented 4 years ago

I've tried to implement it your suggestion, but the kernel just keeps running and nothing happens and there's no error message. Maybe because it's unable to locate the element (since it's not there)?

jsr-p commented 4 years ago

@annalundsoe, so it does not even print out "No alert message"? Seems weird, if it was unable to locate the element you would get the "NoSuchElementException".

annalundsoe commented 4 years ago

Yes, nothing happens...

jsr-p commented 4 years ago

@annalundsoe are you at KU?

annalundsoe commented 4 years ago

Yes :)

jsr-p commented 4 years ago

@annalundsoe come to SODAS :)

Saraiah-TiPeAnCo-Wilson commented 4 years ago

Hi I have a persistent kernel error on my jupyter interface (36 hours).

I am installing some new packages and libraries. Initially, the code ran perfectly. Certain error messages indicated that updated versions of the libraries were necessary. While making alterations to the version number, the kernel crashed. I interrupted the run to amend the code.

I have already;

Could you advise how best to mitigate this issue?

Many thanks, Sarah

jsr-p commented 4 years ago

hi @Saraiah-TiPeAnCo-Wilson, have you tried to restart your computer? If that does not work, try to reinstall anaconda :) If that does not work make a new issue here on Github and we'll solve it.

annalundsoe commented 4 years ago

Hey @jsr-p , it still have some issues with the chrome browser, can I stop by today?

jsr-p commented 4 years ago

@annalundsoe, sure :) Note that I have temporarily moved my office down to CSS 7.01.18 in the basement (aka. fængslet / the jail).

Saraiah-TiPeAnCo-Wilson commented 4 years ago

Hi,

I reinstalled Anaconda, updated the various packages so that they would collaborate, and finally re-installed Jupyter. Everything seems to be working well. Thanks for your help. Sarah

From: jsr-p notifications@github.com Sent: Tuesday, 25 August 2020 10.36 To: abjer/isds2020 isds2020@noreply.github.com Cc: Saraiah-TiPeAnCo-Wilson sarah.stapleton@live.com; Mention mention@noreply.github.com Subject: Re: [abjer/isds2020] Selenium won't log (#49)

hi @Saraiah-TiPeAnCo-Wilsonhttps://github.com/Saraiah-TiPeAnCo-Wilson, have you tried to restart your computer? If that does not work, try to reinstall anaconda :) If that does not work make a new issue here on Github and we'll solve it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/abjer/isds2020/issues/49#issuecomment-679887218, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AQEAQXPQ34SUAEVAOSGU4T3SCNZW5ANCNFSM4QJIELAQ.