Harkame / YggTorrentScraper

YggTorrent scraper
GNU General Public License v3.0
17 stars 3 forks source link

Bypass cloudflare bot detection #15

Open pokapow opened 4 years ago

pokapow commented 4 years ago

This scraper don't work anymore since yggtorrent use cloudflare to detect bot.

This is a way to bypass with, with selenium webdriver and Gecko, with javascript enabled:

https://git.p2p.legal/axiom-team/astroport-iptubes/src/master/yggcrawl/gecko/torrent_search.py

This need more ressources and take few seconds to make requests, but it's work. No way to detect this (except captcha on start of each session, too contraignant for users so I don't think they will do this)

This is just an exemple of how to integrate this on your mechanic for requests.

Harkame commented 4 years ago

Thanks for your help, I'm on it

pokapow commented 4 years ago

Yes! Maybe this lib could be easier and fastest as selenium, and it use Request as you: https://github.com/Anorov/cloudflare-scrape

Harkame commented 4 years ago

I'm using this one https://github.com/VeNoMouS/cloudscraper, its updated more often than cloudflare-scrape.

I've try some request with selenium its working but very unpleasant, I will use it as a last ressort

Harkame commented 4 years ago

I think Selenium version is operational, all methods are same

The only difference is at creation you need an object YggTorrentScraperSelenium

This example should work

from yggtorrentscraper import YggTorrentScraperSelenium
from selenium import webdriver

if __name__ == "__main__":
  options = webdriver.ChromeOptions()
  options.add_argument("--log-level=3")
  options.add_argument("--disable-blink-features")
  options.add_argument("--disable-blink-features=AutomationControlled")
  options.add_experimental_option("excludeSwitches", ["enable-logging"])

  driver = webdriver.Chrome("D:\chromedriver.exe", options=options)

  scraper = YggTorrentScraperSelenium(driver=driver)
  # or
  scraper = YggTorrentScraperSelenium(driver_path="D:\chromedriver.exe")

  if scraper.login("myidentifiant", "mypassword"):
    print("Login success")
    torrents_url = scraper.search({"name": "walking dead"})
    print(torrents_url)
  else:
    print("Login failed")
pokapow commented 4 years ago

Yes this exemple work for me, but with graphical chrome launching.

If I add this option:

options.add_argument("--headless")

This exemple timeout, then when I interrupt:

Traceback (most recent call last):
  File "selenium-test.py", line 22, in <module>
    if scraper.login("user", "pass"):
  File "/home/blabla/Bureau/YggTorrentScraper/yggtorrentscraperr/yggtorrentscraper_selenium.py", line 122, in login
    EC.presence_of_element_located((By.CSS_SELECTOR, "#title"))
  File "/home/blabla/.local/lib/python3.6/site-packages/selenium/webdriver/support/wait.py", line 77, in until
    time.sleep(self._poll)
KeyboardInterrupt
Harkame commented 4 years ago

Same for me this is why I don't use it. In your example you use headless option but with Firefox, maybe its a problem only with chrome

pokapow commented 4 years ago

If I test with firefox/gecko, even without headless, I got errors:

Traceback (most recent call last):
  File "selenium-test.py", line 22, in <module>
    if scraper.login("user", "pass"):
  File "/home/blabla/Bureau/YggTorrentScraper/yggtorrentscraperr/yggtorrentscraper_selenium.py", line 131, in login
    input_identifiant.clear()
  File "/home/poka/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webelement.py", line 95, in clear
    self._execute(Command.CLEAR_ELEMENT)
  File "/home/poka/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webelement.py", line 633, in _execute
    return self._parent.execute(command, params)
  File "/home/poka/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "/home/poka/.local/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.ElementNotInteractableException: Message: Element <input name="id" type="text"> could not be scrolled into view
askz commented 4 years ago

One effective way to run headless (without --headless) is to use Xvfb

❯ Xvfb -ac :99 -screen 0 1280x1024x16 &
export DISPLAY=:99
❯ python scrape/scrape.py
05/31/2020 02:27:32 PM INFO: Attempting login
05/31/2020 02:27:47 PM INFO: Login success

enjoy :)

Harkame commented 4 years ago

If the driver use headless option, cloudflare is not resolved For now, the simplest solution i've found is to add

driver.set_window_position(-10000, 0)

I'm looking for options to hide selenium to cloudflare like there https://stackoverflow.com/questions/55364643/headless-browser-detection but no results for now

askz commented 4 years ago

There is as well this solution : https://stackoverflow.com/a/42851877

pokapow commented 4 years ago

If the driver use headless option, cloudflare is not resolved

Yes in my first exemple use headless option

It's working