abjer / isds2020

Introduction to Social Data Science 2020 - a summer school course abjer.github.io/isds2020
58 stars 92 forks source link

Assistance for 6.1.7 #29

Closed TomGaut closed 3 years ago

TomGaut commented 3 years ago

Hey,

I'm struggling with storing the response I get from the connector.get() method as a .json file and converting it into a pandas DataFrame. I've read in the comments at the bottom of the class that this method doesn't produce an object when using selenium, and states I should get the html using 'connector.browser.page_source'. I've attached my code below which works when just using the requests.get() method, and I was wondering whether you could point me in the right direction here.

import random
import pandas as pd

random_urls = random.sample(page_urls, 20) # Get sample of 20 webpages

connector = Connector('log_file_exercise_6.csv')

for random_url in random_urls:
    response,call_id = connector.get(random_url,'jobs_mapping')
    print(response)               # Make sure the response is 200
    response_dict = response.json()
    readable_file = 'dk_jobnet_sample.json'
    info_job = response_dict["JobPositionPostings"]
    with open(readable_file, 'a') as f:
        json.dump(info_job, f, indent=4)
pd.read_csv('log_file_exercise_6.csv',sep=';')

df_jobs_random = pd.read_json(readable_file)
df_jobs_random

Many thanks, Tom

jsr-p commented 3 years ago

hi @TomGaut , when using the Connector class you should specify the parameter path2selenium with the path to your selenium driver. Note that the class assumes that you are using a firefox driver. So you will have to either download a firefox driver or change the code such that it uses a chrome driver :)

Try out this code:

import random
import pandas as pd
from bs4 import BeautifulSoup

random_urls = random.sample(links, 4) # Get sample of 4 webpages

#I have changed the Connector class to use chrome instead of firefox 
connector = Connector('log_file_selenium.csv',
                       connector_type = "selenium",
                      path2selenium = r"C:\Users\Joune\Desktop\chromedriver_win32\chromedriver.exe")

browser = connector.browser           #browser object inside the class to be used if you want to interact with the browser

for random_url in random_urls:
    connector.get(random_url, 'jobs_mapping')
    soup = BeautifulSoup(browser.page_source)
    response_dict = json.loads(soup.find("body").text)

    readable_file = 'dk_jobnet_sample.json'
    info_job = response_dict["JobPositionPostings"]
    with open(readable_file, 'w') as f:
        json.dump(info_job, f)

#open json and create dataframe 
with open(readable_file, 'r') as f:
    list_of_dicts = json.load(f)
df_sample_links = pd.DataFrame([x.values() for x in list_of_dicts],
                               columns =  list(list_of_dicts[0].keys()))
df_sample_links.tail()

where the last line gives: image

TomGaut commented 3 years ago

Hi @jsr-p ,

Thanks for the help! I keep getting the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-31-a17cdea6e1b1> in <module>
    143 for random_url in random_urls:
    144     connector.get(random_url,'jobs_mapping')
--> 145     soup = BeautifulSoup(browser.page_source)
    146     response_dict = soup.json()

TypeError: 'module' object is not callable

I'm wondering whether this is because a tab pops up to accept cookies and so it's not gathering the data? I've seen the solutions page now that uses just connector.get() and that works for me, so I may just be hitting my head against a brick wall here!

jsr-p commented 3 years ago

Hi @TomGaut , post your full code and we'll figure it out :)

TomGaut commented 3 years ago

Thanks! Here is the full code besides the long Connector() script. It's identical to yours except from the path to where the driver is installed. :)

import random
import pandas as pd
import bs4 as BeautifulSoup
from selenium import webdriver

page_urls = [f'https://job.jobnet.dk/CV/FindWorkOffset={i}&SortValue=CreationDate' for i in range(0, total_result_count, 20)]
random_urls = random.sample(page_urls, 20) # Get sample of 20 webpages

executable_file = r'C:\Users\thoma\anaconda3\geckodriver.exe'
connector = Connector('log_file_exercise_6.csv',
                     connector_type='selenium',
                     path2selenium = executable_file)

browser = connector.browser

for random_url in random_urls:
    connector.get(random_url,'jobs_mapping')
    soup = BeautifulSoup(browser.page_source)
    response_dict = soup.json()

    readable_file = 'dk_jobnet_sample.json'
    info_job = response_dict["JobPositionPostings"]
    with open(readable_file, 'w') as f:
        json.dump(info_job, f, indent=4)

with open(readable_file, 'r') as f:
    list_of_dicts = json.load(f)

df_sample_links = pd.DataFrame([x.values() for x in list_of_dicts],
                               columns =  list(list_of_dicts[0].keys()))
df_sample_links.tail()
jsr-p commented 3 years ago

@TomGaut , you should change the import statement

import bs4 as BeautifulSoup

to

from bs4 import BeautifulSoup

The reason for this is that we want to import BeautifulSoup from the module bs4 and not the module bs4 as BeautifulSoup

TomGaut commented 3 years ago

Ah! I hadn't spotted that! Unfortunately I'm still left with this JSONDecodeError: Expecting value: line 2 column 1 (char1) for the line:

response_dict = json.loads(soup.find("body").text)

I've printed the soup and it seems to be picking up all the required html, so maybe it's a problem with trying to convert the text and read it as .json?

Here's some of the soupiness: <html class="js jobnet no-touchevents fontface generatedcontent svg transform3d transform no-touch transition transitionend input-search device-desktop orientation-landscape min-x-small min-small min-medium min-large max-large only-large noScroll ng-scope" data-build="2020.2.0.397" data-ng-app="Jobnet" lang="da" style=""><head class="ng-isolate-scope" data-jn-header-manager=""><style type="text/css">.ng-animate.item:not(.left):not(.right){-webkit-transition:0s ease-in-out left;transition:0s ease-in-out left}</style><style type="text/css">...

jsr-p commented 3 years ago

@TomGaut , you need to make sure that what comes out from soup.find("body").text looks like json such that json.loads() does not throw an error. Are you scraping the correct links?

TomGaut commented 3 years ago

It seems not since the browser loads onto a cookie page and stops there. I think I would have to code something to click past that and then begin searching from there. Thanks for the help! I think in this instance it's probably easier to go direct to the page. :)