Open pokapow opened 4 years ago
Thanks for your help, I'm on it
Yes! Maybe this lib could be easier and fastest as selenium, and it use Request as you: https://github.com/Anorov/cloudflare-scrape
I'm using this one https://github.com/VeNoMouS/cloudscraper, its updated more often than cloudflare-scrape.
I've try some request with selenium its working but very unpleasant, I will use it as a last ressort
I think Selenium version is operational, all methods are same
The only difference is at creation you need an object YggTorrentScraperSelenium
This example should work
from yggtorrentscraper import YggTorrentScraperSelenium
from selenium import webdriver
if __name__ == "__main__":
options = webdriver.ChromeOptions()
options.add_argument("--log-level=3")
options.add_argument("--disable-blink-features")
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_experimental_option("excludeSwitches", ["enable-logging"])
driver = webdriver.Chrome("D:\chromedriver.exe", options=options)
scraper = YggTorrentScraperSelenium(driver=driver)
# or
scraper = YggTorrentScraperSelenium(driver_path="D:\chromedriver.exe")
if scraper.login("myidentifiant", "mypassword"):
print("Login success")
torrents_url = scraper.search({"name": "walking dead"})
print(torrents_url)
else:
print("Login failed")
Yes this exemple work for me, but with graphical chrome launching.
If I add this option:
options.add_argument("--headless")
This exemple timeout, then when I interrupt:
Traceback (most recent call last):
File "selenium-test.py", line 22, in <module>
if scraper.login("user", "pass"):
File "/home/blabla/Bureau/YggTorrentScraper/yggtorrentscraperr/yggtorrentscraper_selenium.py", line 122, in login
EC.presence_of_element_located((By.CSS_SELECTOR, "#title"))
File "/home/blabla/.local/lib/python3.6/site-packages/selenium/webdriver/support/wait.py", line 77, in until
time.sleep(self._poll)
KeyboardInterrupt
Same for me this is why I don't use it. In your example you use headless option but with Firefox, maybe its a problem only with chrome
If I test with firefox/gecko, even without headless, I got errors:
Traceback (most recent call last):
File "selenium-test.py", line 22, in <module>
if scraper.login("user", "pass"):
File "/home/blabla/Bureau/YggTorrentScraper/yggtorrentscraperr/yggtorrentscraper_selenium.py", line 131, in login
input_identifiant.clear()
File "/home/poka/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webelement.py", line 95, in clear
self._execute(Command.CLEAR_ELEMENT)
File "/home/poka/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webelement.py", line 633, in _execute
return self._parent.execute(command, params)
File "/home/poka/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/home/poka/.local/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.ElementNotInteractableException: Message: Element <input name="id" type="text"> could not be scrolled into view
One effective way to run headless (without --headless) is to use Xvfb
❯ Xvfb -ac :99 -screen 0 1280x1024x16 &
export DISPLAY=:99
❯ python scrape/scrape.py
05/31/2020 02:27:32 PM INFO: Attempting login
05/31/2020 02:27:47 PM INFO: Login success
enjoy :)
If the driver use headless option, cloudflare is not resolved For now, the simplest solution i've found is to add
driver.set_window_position(-10000, 0)
I'm looking for options to hide selenium to cloudflare like there https://stackoverflow.com/questions/55364643/headless-browser-detection but no results for now
There is as well this solution : https://stackoverflow.com/a/42851877
If the driver use headless option, cloudflare is not resolved
Yes in my first exemple use headless option
It's working
This scraper don't work anymore since yggtorrent use cloudflare to detect bot.
This is a way to bypass with, with selenium webdriver and Gecko, with javascript enabled:
https://git.p2p.legal/axiom-team/astroport-iptubes/src/master/yggcrawl/gecko/torrent_search.py
This need more ressources and take few seconds to make requests, but it's work. No way to detect this (except captcha on start of each session, too contraignant for users so I don't think they will do this)
This is just an exemple of how to integrate this on your mechanic for requests.