Closed TomGaut closed 3 years ago
hi @TomGaut ,
when using the Connector
class you should specify the parameter path2selenium
with the path to your selenium driver.
Note that the class assumes that you are using a firefox driver. So you will have to either download a firefox driver or change the code such that it uses a chrome driver :)
Try out this code:
import random
import pandas as pd
from bs4 import BeautifulSoup
random_urls = random.sample(links, 4) # Get sample of 4 webpages
#I have changed the Connector class to use chrome instead of firefox
connector = Connector('log_file_selenium.csv',
connector_type = "selenium",
path2selenium = r"C:\Users\Joune\Desktop\chromedriver_win32\chromedriver.exe")
browser = connector.browser #browser object inside the class to be used if you want to interact with the browser
for random_url in random_urls:
connector.get(random_url, 'jobs_mapping')
soup = BeautifulSoup(browser.page_source)
response_dict = json.loads(soup.find("body").text)
readable_file = 'dk_jobnet_sample.json'
info_job = response_dict["JobPositionPostings"]
with open(readable_file, 'w') as f:
json.dump(info_job, f)
#open json and create dataframe
with open(readable_file, 'r') as f:
list_of_dicts = json.load(f)
df_sample_links = pd.DataFrame([x.values() for x in list_of_dicts],
columns = list(list_of_dicts[0].keys()))
df_sample_links.tail()
where the last line gives:
Hi @jsr-p ,
Thanks for the help! I keep getting the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-31-a17cdea6e1b1> in <module>
143 for random_url in random_urls:
144 connector.get(random_url,'jobs_mapping')
--> 145 soup = BeautifulSoup(browser.page_source)
146 response_dict = soup.json()
TypeError: 'module' object is not callable
I'm wondering whether this is because a tab pops up to accept cookies and so it's not gathering the data? I've seen the solutions page now that uses just connector.get() and that works for me, so I may just be hitting my head against a brick wall here!
Hi @TomGaut , post your full code and we'll figure it out :)
Thanks! Here is the full code besides the long Connector() script. It's identical to yours except from the path to where the driver is installed. :)
import random
import pandas as pd
import bs4 as BeautifulSoup
from selenium import webdriver
page_urls = [f'https://job.jobnet.dk/CV/FindWorkOffset={i}&SortValue=CreationDate' for i in range(0, total_result_count, 20)]
random_urls = random.sample(page_urls, 20) # Get sample of 20 webpages
executable_file = r'C:\Users\thoma\anaconda3\geckodriver.exe'
connector = Connector('log_file_exercise_6.csv',
connector_type='selenium',
path2selenium = executable_file)
browser = connector.browser
for random_url in random_urls:
connector.get(random_url,'jobs_mapping')
soup = BeautifulSoup(browser.page_source)
response_dict = soup.json()
readable_file = 'dk_jobnet_sample.json'
info_job = response_dict["JobPositionPostings"]
with open(readable_file, 'w') as f:
json.dump(info_job, f, indent=4)
with open(readable_file, 'r') as f:
list_of_dicts = json.load(f)
df_sample_links = pd.DataFrame([x.values() for x in list_of_dicts],
columns = list(list_of_dicts[0].keys()))
df_sample_links.tail()
@TomGaut , you should change the import statement
import bs4 as BeautifulSoup
to
from bs4 import BeautifulSoup
The reason for this is that we want to import BeautifulSoup
from the module bs4
and not the module bs4
as BeautifulSoup
Ah! I hadn't spotted that! Unfortunately I'm still left with this JSONDecodeError: Expecting value: line 2 column 1 (char1) for the line:
response_dict = json.loads(soup.find("body").text)
I've printed the soup and it seems to be picking up all the required html, so maybe it's a problem with trying to convert the text and read it as .json?
Here's some of the soupiness:
<html class="js jobnet no-touchevents fontface generatedcontent svg transform3d transform no-touch transition transitionend input-search device-desktop orientation-landscape min-x-small min-small min-medium min-large max-large only-large noScroll ng-scope" data-build="2020.2.0.397" data-ng-app="Jobnet" lang="da" style=""><head class="ng-isolate-scope" data-jn-header-manager=""><style type="text/css">.ng-animate.item:not(.left):not(.right){-webkit-transition:0s ease-in-out left;transition:0s ease-in-out left}</style><style type="text/css">...
@TomGaut , you need to make sure that what comes out from soup.find("body").text
looks like json such that json.loads()
does not throw an error. Are you scraping the correct links?
It seems not since the browser loads onto a cookie page and stops there. I think I would have to code something to click past that and then begin searching from there. Thanks for the help! I think in this instance it's probably easier to go direct to the page. :)
Hey,
I'm struggling with storing the response I get from the connector.get() method as a .json file and converting it into a pandas DataFrame. I've read in the comments at the bottom of the class that this method doesn't produce an object when using selenium, and states I should get the html using 'connector.browser.page_source'. I've attached my code below which works when just using the requests.get() method, and I was wondering whether you could point me in the right direction here.
Many thanks, Tom