elvisyjlin / media-scraper

Scrapes all photos and videos in a web page / Instagram / Twitter / Tumblr / Reddit / pixiv / TikTok
MIT License
370 stars 49 forks source link

media scraper does not work #24

Open james16000 opened 2 years ago

james16000 commented 2 years ago

hi Windows 7 Python 3.7.6 pip 21.3.1

It does not work on Reddit, Twitter, Instagram and Tik Tok After loading the repository and run it on cmd I wanted to try it on Instagram, Tiktok, Twitter and Reddit I see this problem on Instagram

I will give you screenshots of the problem

C:\Users\pc\Desktop\media-scraper>python -m mediascraper.instagram instagram Starting PhantomJS web driver... .\webdriver/phantomjsdriver_2.1.1_win32/phantomjs.exe C:\Users\pc\AppData\Local\Programs\Python\Python37-32\lib\site-packages\selenium\webdriver\phantomjs\webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headle ss versions of Chrome or Firefox instead warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless ' Crawling... Traceback (most recent call last): File "C:\Users\pc\AppData\Local\Programs\Python\Python37-32\lib\runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "C:\Users\pc\AppData\Local\Programs\Python\Python37-32\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\Users\pc\Desktop\media-scraper\mediascraper\instagram.py", line 16, in <module> tasks = scraper.scrape(username) File "C:\Users\pc\Desktop\media-scraper\mediascrapers.py", line 262, in scrape tasks += (task[0], username, task[1]) IndexError: list index out of range

1

C:\Users\pc\Desktop\media-scraper>python m-scraper.py rq instagram instagram Namespace(credential_file=None, early_stop=False, keywords=['instagram'], save_path=None) Instagramer Task: instagram Traceback (most recent call last): File "m-scraper.py", line 36, in <module> scraper.run(sys.argv[3:]) File "C:\Users\pc\Desktop\media-scraper\m_scraper\rq\downloader.py", line 82, in run self.crawl(keyword, args.early_stop) File "C:\Users\pc\Desktop\media-scraper\m_scraper\rq\instagramer.py", line 46, in crawl tasks, end_cursor, has_next, length, user_id, rhx_gis, csrf_token = get_first_page(username) File "C:\Users\pc\Desktop\media-scraper\m_scraper\rq\utils\instagram.py", line 52, in get_first_page rhx_gis = shared_data['rhx_gis'] KeyError: 'rhx_gis'

2

tiktok

C:\Users\pc\Desktop\media-scraper>python m-scraper.py rq tiktok tiktok Namespace(credential_file=None, early_stop=False, keywords=['tiktok'], save_path=None) {'statusCode': 10000, 'verifyConfig': {'code': 10000, 'type': 'verify', 'subtype': 'slide', 'fp': 'verify_dd9489ac31d2e6b50a4f9ed75b5240f2', 'region': 'va', 'detail': 'vEyCkJEKBnSe-zq257GFQJrLW03-aOs8 awmNop3PD5IGQA4kjoDDIU6NDQKG7BnEsMWT8C-WHIUjsfHZ9OMgl9009Qcdo2LIOBhGJyNK118AOCRmw8StlADDjuzkZrFHFDTHnSgp2x651wwrNM6-FYFCOlP0izZx6n*pCjcMIM1sjOh0zwAye*FM5lPnHiVJ1eER3KmM*q6VpyCU*uNyTeYkaDpcFOMdgP3br0Hl sWO--*jeaUPVnSjP8RejdrEQgq7oLsXM4rjjf14GhyWBa0H8kj*LODz42UoKrM32r4Fm6VjEAoEeRrjmHVUkwbwAptLOsfmREJTSdtNToMx6t4NqBXWm0mJ24vXdY9Txp83rH49pTmZE1wbEupTi18B1Tw..'}} Traceback (most recent call last): File "m-scraper.py", line 36, in <module> scraper.run(sys.argv[3:]) File "C:\Users\pc\Desktop\media-scraper\m_scraper\rq\downloader.py", line 82, in run self.crawl(keyword, args.early_stop) File "C:\Users\pc\Desktop\media-scraper\m_scraper\rq\tiktoker.py", line 38, in crawl raise Exception('body not found') Exception: body not found 3

twitter

C:\Users\pc\Desktop\media-scraper>python -m mediascraper.twitter twitter Starting PhantomJS web driver... .\webdriver/phantomjsdriver_2.1.1_win32/phantomjs.exe C:\Users\pc\AppData\Local\Programs\Python\Python37-32\lib\site-packages\selenium\webdriver\phantomjs\webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headle ss versions of Chrome or Firefox instead warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless ' Crawling... Traceback (most recent call last): File "C:\Users\pc\AppData\Local\Programs\Python\Python37-32\lib\runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "C:\Users\pc\AppData\Local\Programs\Python\Python37-32\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\Users\pc\Desktop\media-scraper\mediascraper\twitter.py", line 18, in <module> tasks = scraper.scrape(username) File "C:\Users\pc\Desktop\media-scraper\mediascrapers.py", line 392, in scrape done = self.scrollToBottom() File "C:\Users\pc\Desktop\media-scraper\mediascrapers.py", line 87, in scrollToBottom last_height, new_height = self._driver.execute_script("return document.body.scrollHeight"), 0 File "C:\Users\pc\AppData\Local\Programs\Python\Python37-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 636, in execute_script 'args': converted_args})['value'] File "C:\Users\pc\AppData\Local\Programs\Python\Python37-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute self.error_handler.check_response(response) File "C:\Users\pc\AppData\Local\Programs\Python\Python37-32\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.WebDriverException: Message: {"errorMessage":"Refused to evaluate a string as JavaScript because 'unsafe-eval' is not an allowed source of script in the following Content Se curity Policy directive: \"script-src 'self' 'unsafe-inline' https://*.twimg.com https://recaptcha.net/recaptcha/ https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/ https://www.googl e-analytics.com https://twitter.com https://app.link https://accounts.google.com/gsi/client https://appleid.cdn-apple.com/appleauth/static/jsapi/appleid/1/en_US/appleid.auth.js 'nonce-MDk3YWFmNWEtOGR kZS00NGQzLWE2MjMtZjUzNzhjMGZhZGJl'\".\n","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Content-Length":"112","Content-Type":"application/json;charset=UTF-8","Host":"1 27.0.0.1:55938","User-Agent":"selenium/3.141.0 (python windows)"},"httpVersion":"1.1","method":"POST","post":"{\"script\": \"return document.body.scrollHeight\", \"args\": [], \"sessionId\": \"7c44f2b 0-7a92-11ec-a2dc-d38583c010ce\"}","url":"/execute","urlParsed":{"anchor":"","query":"","file":"execute","directory":"/","path":"/execute","relative":"/execute","port":"","host":"","password":"","user" :"","userInfo":"","authority":"","protocol":"","source":"/execute","queryKey":{},"chunks":["execute"]},"urlOriginal":"/session/7c44f2b0-7a92-11ec-a2dc-d38583c010ce/execute"}} Screenshot: available via screen 4

I also tried set the web driver to 777 for convenience. C:\Users\pc\Desktop\media-scraper>chmod 777 webdriver/phantomjsdriver_2.1.1_win32/phantomjs.exe 'chmod' is not recognized as an internal or external command, operable program or batch file.

5

I wish this tool was working because it would be the best tool on the internet Greetings to all

@elvisyjlin