Open Purefreeman opened 6 years ago
The The file [img path] exists. Skip it.
occurs probably because mediascraper
detects some media two times. Those skipped images are supposed to exist under your download folder.
However, I found another interesting bug and haven't dived into this problem yet. The program detects different numbers of media each time it crawls the same account. I think the reason is Twitter officially has Rate Limit so it might get no media if you exceed the Rate Limit.
Pausing the downloading procedure or crawling with another IP would be a workaround solution currently.
The first time I crawled kheshig
python3 -m mediascraper.twitter kheshig
I got
Starting PhantomJS web driver... ./webdriver/phantomjsdriver_2.1.1_linux64/phantomjs /usr/local/lib/python3.5/dist-packages/selenium/webdriver/phantomjs/webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless ' Crawling... 236 media are found. Downloading... 81%|#########################################2 | 191/236 [03:53<00:55, 1.22s/it]The file download/twitter/kheshig/DfWRZtAX0AE6aos.jpg exists. Skip it. The file download/twitter/kheshig/DfWRaD3XUAU7DVc.jpg exists. Skip it. The file download/twitter/kheshig/DfWRahRXkAAOAEN.jpg exists. Skip it. The file download/twitter/kheshig/DfWRa4PWsAEjGpO.jpg exists. Skip it. The file download/twitter/kheshig/DfWRWnWXkAIbYv9.jpg exists. Skip it. The file download/twitter/kheshig/DfWRXY_WAAIOoe6.jpg exists. Skip it. The file download/twitter/kheshig/DfWRR0cXcAErec4.jpg exists. Skip it. The file download/twitter/kheshig/DfWRSd_XkAANXsq.jpg exists. Skip it. The file download/twitter/kheshig/DfWRS55X4AEGjTs.jpg exists. Skip it. The file download/twitter/kheshig/DfWQY6sXUAI6lNx.jpg exists. Skip it. The file download/twitter/kheshig/DfWQWRtXUAIZUwt.jpg exists. Skip it. The file download/twitter/kheshig/DfWQUF3W4AAzG3S.jpg exists. Skip it. The file download/twitter/kheshig/DfWQPoYXkAA6lQw.jpg exists. Skip it. The file download/twitter/kheshig/DfWQNU-XcAApfXR.jpg exists. Skip it. The file download/twitter/kheshig/DfWQIyOXUAASqVr.jpg exists. Skip it. The file download/twitter/kheshig/DfWQG06X4AANiOp.jpg exists. Skip it. The file download/twitter/kheshig/DfWQEspXcAIxbC.jpg exists. Skip it. The file download/twitter/kheshig/DfWQCf6X4AAlqcF.jpg exists. Skip it. The file download/twitter/kheshig/DfWQA-SWAAAh8ED.jpg exists. Skip it. The file download/twitter/kheshig/DfWP8ocWAAELICF.jpg exists. Skip it. The file download/twitter/kheshig/DfWP5HlWsAAwcU5.jpg exists. Skip it. The file download/twitter/kheshig/DfWP1yyWAAACjNB.jpg exists. Skip it. The file download/twitter/kheshig/DfWMr6MWAAUgcB.jpg exists. Skip it. The file download/twitter/kheshig/DfWMk8nW4AAtnSc.jpg exists. Skip it. The file download/twitter/kheshig/DfWMZ0bW0AAI1pX.jpg exists. Skip it. The file download/twitter/kheshig/DfWMaOvW0AQnPPs.jpg exists. Skip it. The file download/twitter/kheshig/DfWMazDX0AUMO8p.jpg exists. Skip it. The file download/twitter/kheshig/DfWMbOMX0AAYlZb.jpg exists. Skip it. The file download/twitter/kheshig/DfWMScQXcAolR1z.jpg exists. Skip it. The file download/twitter/kheshig/DfWMOBIW4AANRLS.jpg exists. Skip it. The file download/twitter/kheshig/DfWMObHW0AELSx-.jpg exists. Skip it. The file download/twitter/kheshig/DfWLTwNX4AAYMFn.jpg exists. Skip it. The file download/twitter/kheshig/DfWJnIWkAIW7Ok.jpg exists. Skip it. The file download/twitter/kheshig/DfWEoQ8WkAE6s0l.jpg exists. Skip it. The file download/twitter/kheshig/DfWEJd1XkAE6Edp.jpg exists. Skip it. The file download/twitter/kheshig/DfWEKmeW0AIpDNs.jpg exists. Skip it. The file download/twitter/kheshig/DfWELsWWAAA7xcn.jpg exists. Skip it. The file download/twitter/kheshig/DfWEBQMW0AEQVUW.jpg exists. Skip it. The file download/twitter/kheshig/DfWEBy8XUAIbLEx.jpg exists. Skip it. The file download/twitter/kheshig/DfWECRkWsAAee42.jpg exists. Skip it. The file download/twitter/kheshig/DfWEDTPXUAAGRiw.jpg exists. Skip it. The file download/twitter/kheshig/DfWD8NMWsAAbwwa.jpg exists. Skip it. The file download/twitter/kheshig/DfWD8u2XUAEFpVb.jpg exists. Skip it. The file download/twitter/kheshig/DfWD9JMW0AElLi.jpg exists. Skip it. The file download/twitter/kheshig/DfWD9kBX4AAcS-e.jpg exists. Skip it. 100%|###################################################| 236/236 [03:53<00:00, 1.01it/s]
Some images are detected again so they are skipped in the latter downloading.
And I deleted all media under the folder kheshig
rm download/twitter/kheshig/*
Then I crawled again 30 minutes later.
python3 -m mediascraper.twitter kheshig
And I got
Starting Phantom JS web driver... ./webdriver/phantomjsdriver_2.1.1_linux64/phantomjs /usr/local/lib/python3.5/dist-packages/selenium/webdriver/phantomjs/webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless ' Crawling... 236 media are found. Downloading... 81%|#########################################2 | 191/236 [01:52<00:26, 1.70it/s]The file download/twitter/kheshig/DfWRZtAX0AE6aos.jpg exists. Skip it. The file download/twitter/kheshig/DfWRaD3XUAU7DVc.jpg exists. Skip it. The file download/twitter/kheshig/DfWRahRXkAAOAEN.jpg exists. Skip it. The file download/twitter/kheshig/DfWRa4PWsAEjGpO.jpg exists. Skip it. The file download/twitter/kheshig/DfWRWnWXkAIbYv9.jpg exists. Skip it. The file download/twitter/kheshig/DfWRXY_WAAIOoe6.jpg exists. Skip it. The file download/twitter/kheshig/DfWRR0cXcAErec4.jpg exists. Skip it. The file download/twitter/kheshig/DfWRSd_XkAANXsq.jpg exists. Skip it. The file download/twitter/kheshig/DfWRS55X4AEGjTs.jpg exists. Skip it. The file download/twitter/kheshig/DfWQY6sXUAI6lNx.jpg exists. Skip it. The file download/twitter/kheshig/DfWQWRtXUAIZUwt.jpg exists. Skip it. The file download/twitter/kheshig/DfWQUF3W4AAzG3S.jpg exists. Skip it. The file download/twitter/kheshig/DfWQPoYXkAA6lQw.jpg exists. Skip it. The file download/twitter/kheshig/DfWQNU-XcAApfXR.jpg exists. Skip it. The file download/twitter/kheshig/DfWQIyOXUAASqVr.jpg exists. Skip it. The file download/twitter/kheshig/DfWQG06X4AANiOp.jpg exists. Skip it. The file download/twitter/kheshig/DfWQEspXcAIxbC.jpg exists. Skip it. The file download/twitter/kheshig/DfWQCf6X4AAlqcF.jpg exists. Skip it. The file download/twitter/kheshig/DfWQA-SWAAAh8ED.jpg exists. Skip it. The file download/twitter/kheshig/DfWP8ocWAAELICF.jpg exists. Skip it. The file download/twitter/kheshig/DfWP5HlWsAAwcU5.jpg exists. Skip it. The file download/twitter/kheshig/DfWP1yyWAAACjNB.jpg exists. Skip it. The file download/twitter/kheshig/DfWMr6MWAAUgcB.jpg exists. Skip it. The file download/twitter/kheshig/DfWMk8nW4AAtnSc.jpg exists. Skip it. The file download/twitter/kheshig/DfWMZ0bW0AAI1pX.jpg exists. Skip it. The file download/twitter/kheshig/DfWMaOvW0AQnPPs.jpg exists. Skip it. The file download/twitter/kheshig/DfWMazDX0AUMO8p.jpg exists. Skip it. The file download/twitter/kheshig/DfWMbOMX0AAYlZb.jpg exists. Skip it. The file download/twitter/kheshig/DfWMScQXcAolR1z.jpg exists. Skip it. The file download/twitter/kheshig/DfWMOBIW4AANRLS.jpg exists. Skip it. The file download/twitter/kheshig/DfWMObHW0AELSx-.jpg exists. Skip it. The file download/twitter/kheshig/DfWLTwNX4AAYMFn.jpg exists. Skip it. The file download/twitter/kheshig/DfWJnIWkAIW7Ok.jpg exists. Skip it. The file download/twitter/kheshig/DfWEoQ8WkAE6s0l.jpg exists. Skip it. The file download/twitter/kheshig/DfWEJd1XkAE6Edp.jpg exists. Skip it. The file download/twitter/kheshig/DfWEKmeW0AIpDNs.jpg exists. Skip it. The file download/twitter/kheshig/DfWELsWWAAA7xcn.jpg exists. Skip it. The file download/twitter/kheshig/DfWEBQMW0AEQVUW.jpg exists. Skip it. The file download/twitter/kheshig/DfWEBy8XUAIbLEx.jpg exists. Skip it. The file download/twitter/kheshig/DfWECRkWsAAee42.jpg exists. Skip it. The file download/twitter/kheshig/DfWEDTPXUAAGRiw.jpg exists. Skip it. The file download/twitter/kheshig/DfWD8NMWsAAbwwa.jpg exists. Skip it. The file download/twitter/kheshig/DfWD8u2XUAEFpVb.jpg exists. Skip it. The file download/twitter/kheshig/DfWD9JMW0AElLi.jpg exists. Skip it. The file download/twitter/kheshig/DfWD9kBX4AAcS-e.jpg exists. Skip it. 100%|###################################################| 236/236 [01:52<00:00, 2.10it/s]
Got exactly the same result.
Is there a reason why if i delete all files from a folder(or the entire folder) lets say download\twitter\kheshig, and i try to recraw the account it says some files already exist and skips them. Although this could be a useful function if you don't want the crawler to re-download pictures/video you deleted as you might not like want to constantly deleting things every time you use the crawler. Is there a way to delete stored data of the downloaded files. only 44 of the files were downloaded out of 100(this should be the amount that the crawler should've registered)
python -m mediascraper.twitter kheshig
Starting PhantomJS web driver... .\webdriver/phantomjsdriver_2.1.1_win32/phantomjs.exe C:\Users\dolap_000\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\phantomjs\webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless ' Logging in as "Kheshig"... Crawling... 90 media are found. Downloading... 49%|██████████████████████████▉ | 44/90 [00:04<00:04, 9.50it/s]The file download/twitter\kheshig\DfWRZtAX0AE6aos.jpg exists. Skip it. The file download/twitter\kheshig\DfWRaD3XUAU7DVc.jpg exists. Skip it. The file download/twitter\kheshig\DfWRahRXkAAOAEN.jpg exists. Skip it. The file download/twitter\kheshig\DfWRa4PWsAEjGpO.jpg exists. Skip it. The file download/twitter\kheshig\DfWRWnWXkAIbYv9.jpg exists. Skip it. The file download/twitter\kheshig\DfWRXY_WAAIOoe6.jpg exists. Skip it. The file download/twitter\kheshig\DfWRR0cXcAErec4.jpg exists. Skip it. The file download/twitter\kheshig\DfWRSd_XkAANXsq.jpg exists. Skip it. The file download/twitter\kheshig\DfWRS55X4AEGjTs.jpg exists. Skip it. The file download/twitter\kheshig\DfWQY6sXUAI6lNx.jpg exists. Skip it. The file download/twitter\kheshig\DfWQWRtXUAIZUwt.jpg exists. Skip it. The file download/twitter\kheshig\DfWQUF3W4AAzG3S.jpg exists. Skip it. The file download/twitter\kheshig\DfWQPoYXkAA6lQw.jpg exists. Skip it. The file download/twitter\kheshig\DfWQNU-XcAApfXR.jpg exists. Skip it. The file download/twitter\kheshig\DfWQIyOXUAASqVr.jpg exists. Skip it. The file download/twitter\kheshig\DfWQG06X4AANiOp.jpg exists. Skip it. The file download/twitter\kheshig\DfWQEspXcAIxbC.jpg exists. Skip it. The file download/twitter\kheshig\DfWQCf6X4AAlqcF.jpg exists. Skip it. The file download/twitter\kheshig\DfWQA-SWAAAh8ED.jpg exists. Skip it. The file download/twitter\kheshig\DfWP8ocWAAELICF.jpg exists. Skip it. The file download/twitter\kheshig\DfWP5HlWsAAwcU5.jpg exists. Skip it. The file download/twitter\kheshig\DfWP1yyWAAACjNB.jpg exists. Skip it. The file download/twitter\kheshig\DfWMr6MWAAUgcB.jpg exists. Skip it. The file download/twitter\kheshig\DfWMk8nW4AAtnSc.jpg exists. Skip it. The file download/twitter\kheshig\DfWMZ0bW0AAI1pX.jpg exists. Skip it. The file download/twitter\kheshig\DfWMaOvW0AQnPPs.jpg exists. Skip it. The file download/twitter\kheshig\DfWMazDX0AUMO8p.jpg exists. Skip it. The file download/twitter\kheshig\DfWMbOMX0AAYlZb.jpg exists. Skip it. The file download/twitter\kheshig\DfWMScQXcAolR1z.jpg exists. Skip it. The file download/twitter\kheshig\DfWMOBIW4AANRLS.jpg exists. Skip it. The file download/twitter\kheshig\DfWMObHW0AELSx-.jpg exists. Skip it. The file download/twitter\kheshig\DfWLTwNX4AAYMFn.jpg exists. Skip it. The file download/twitter\kheshig\DfWJnIWkAIW7Ok.jpg exists. Skip it. The file download/twitter\kheshig\DfWEoQ8WkAE6s0l.jpg exists. Skip it. The file download/twitter\kheshig\DfWEJd1XkAE6Edp.jpg exists. Skip it. The file download/twitter\kheshig\DfWEKmeW0AIpDNs.jpg exists. Skip it. The file download/twitter\kheshig\DfWELsWWAAA7xcn.jpg exists. Skip it. The file download/twitter\kheshig\DfWEBQMW0AEQVUW.jpg exists. Skip it. The file download/twitter\kheshig\DfWEBy8XUAIbLEx.jpg exists. Skip it. The file download/twitter\kheshig\DfWECRkWsAAee42.jpg exists. Skip it. The file download/twitter\kheshig\DfWEDTPXUAAGRiw.jpg exists. Skip it. The file download/twitter\kheshig\DfWD8NMWsAAbwwa.jpg exists. Skip it. The file download/twitter\kheshig\DfWD8u2XUAEFpVb.jpg exists. Skip it. The file download/twitter\kheshig\DfWD9JMW0AElLi.jpg exists. Skip it. The file download/twitter\kheshig\DfWD9kBX4AAcS-e.jpg exists. Skip it. 100%|███████████████████████████████████████████████████████| 90/90 [00:04<00:00, 19.04it/s]