Neod0Matrix / PixivCrawlerIII

A python3 crawler for crawling Pixiv ranking top and any illustrator all artworks
MIT License
36 stars 9 forks source link
crawler crypto multithreading pixiv python

Python 3.6

PixivCrawlerIII - A \ Pixiv website crawler with python3

██████╗ ██╗██╗  ██╗██╗██╗   ██╗ ██████╗██████╗  █████╗ ██╗    ██╗██╗     ███████╗██████╗ ██╗██╗██╗
██╔══██╗██║╚██╗██╔╝██║██║   ██║██╔════╝██╔══██╗██╔══██╗██║    ██║██║     ██╔════╝██╔══██╗██║██║██║
██████╔╝██║ ╚███╔╝ ██║██║   ██║██║     ██████╔╝███████║██║ █╗ ██║██║     █████╗  ██████╔╝██║██║██║
██╔═══╝ ██║ ██╔██╗ ██║╚██╗ ██╔╝██║     ██╔══██╗██╔══██║██║███╗██║██║     ██╔══╝  ██╔══██╗██║██║██║
██║     ██║██╔╝ ██╗██║ ╚████╔╝ ╚██████╗██║  ██║██║  ██║╚███╔███╔╝███████╗███████╗██║  ██║██║██║██║
╚═╝     ╚═╝╚═╝  ╚═╝╚═╝  ╚═══╝   ╚═════╝╚═╝  ╚═╝╚═╝  ╚═╝ ╚══╝╚══╝ ╚══════╝╚══════╝╚═╝  ╚═╝╚═╝╚═╝╚═╝

ASCII artword from http://patorjk.com/software/taag/ font: ANSI Shadow

LICENSE

Copyright(C) 2017-2020 T.WKVER | </MATRIX>. All rights reserved.
Code by </MATRIX>@Neod Anderjon(LeaderN)
MIT license read in LICENSE
Thanks to watch my project
If you want to help me improve this project, please submit an issue or fork

CHANGELOG

2020/06/07
Version: 3.3.3
Selenium crawled the pixiv homepage cookie ok, 
but the login return to the server is invalid, not resolved.

2020/02/03
Version: 3.3.2
Fixed last month commit bug.
Refactor main logic.
Server IRA mode add multi-id input.
Add class declare for mode option class init.
Add R18G rank in RTN mode.
Spec file update.

2020/01/20
Version: 3.2.4
Fixed custom label bug.
Refactor mode option structure.
Refactor wkv crawler api.

2020/01/19
Version: 3.2.3
Remove invalid proxy server website method.
Add emoji module to process unicode 'U+' emoji.

2020/01/18
Version: 3.2.2
Total refactor.
Code structure optimize.

PLATFORM

Linux x86_64 and Windows NT(tested in Ubuntu 16.04 x64 and Windows 10 x64 1803)
Python: 3.x(not support 2.x) suggest 3.5+(3.6 and 3.7 tested over)

REQUIREMENTS

RUN

last python2 version: (very old version, maintenance has been discontinued)

PROBLEMS THAT MAY ARISE

May the good network status with you.

To ensure that the display output is normal, 
please set the console code to UTF-8, 
the windows system to use the command "chcp 65001".

If you use the crawler too often to request data from the server, 
the server may return an 10060 error for you, 
just need to wait for a while and then try again, or use a proxy server.

If your test network environment has been dns-polluted, I suggest you 
fix your PC dns-server to a pure server or get a proxy server.

Version 2.7.8 is the last batch download solution 
that loads the main-page for the Pixiv website's old static HTML page.
From October 2, 2018, 
Pixiv began to use js-dynamically load the artist's home page information.
On October 4, 2018, in response to the countermeasures made 
on the website 1002 big change event, version V2.8.2 was fully optimized 
and upgraded, the original two download modes were restored. 
At the same time, one request for downloading was suspended after one login.

If you want to optimze CPU and memory usage, you can use cProfile tool to 
analysis object usage and use module gc to collecte garbage.

Since January 2020, this project uses selenium module and chromedriver to obtain cookies 
to solve recaptcha authentication problem of pixiv website.
You need to configure and install chromedriver according to the 
official Selenium tutorial(https://selenium-python.readthedocs.io/index.html) 
in the corresponding system environment and modify its path in the dataload.py(chrome_user_data_dir).

If you update chrome in your environment, please update the chromedriver to the same version
in page http://chromedriver.storage.googleapis.com/index.html