Open melody1126 opened 2 years ago
Collecting data from websites that require login can be complicated. A common way is to use the Selenium
package.
For instance, the following code automatically login to GitHub using Selenium
. This allows you to access and collect all contents in GitHub that require login (Note that Selenium
package may not work well on Google Colab).
[1] Install Selenium
and chromedriver
pip install selenium
brew install chromedriver
This is for Mac users, Windows users should run pip install selenium
and manually download chromedriver (See https://chromedriver.chromium.org/home)
[2] Run the following Python code
from selenium import webdriver
driver = webdriver.Chrome(executable_path='/opt/homebrew/bin/chromedriver')
driver.get('https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2FUChicago-Computational-Content-Analysis%2FFrequently-Asked-Questions')
driver.find_element_by_id('login_field').send_keys('Your GitHub ID')
driver.find_element_by_id('password').send_keys('Your GitHub PW')
driver.find_element_by_id('password').send_keys(Keys.ENTER)
when we use a weblink that is not open to the public (like Wikipedia), but requires login (like JSTOR, or any of the databases on the UChicago Library site), the link contains something like "proxy.uchicago.edu" and scraping returns the following:
"Shibboleth Authentication Request If your browser does not continue automatically, click ..."
how can we go around this?