Nandaka / PixivUtil2

Download images from Pixiv and more!
http://nandaka.devnull.zone/
BSD 2-Clause "Simplified" License
2.4k stars 254 forks source link

1000 Page Limit #85

Closed danthonywalker closed 9 years ago

danthonywalker commented 9 years ago

I realize that it's a Pixiv limit that the number of entries can only go up to 1000 pages before it ends. The solution described here: https://nandaka.wordpress.com/2012/01/13/pixiv-downloader-20120114/#comment-1385 found sort of a workaround, and so a suggestion may be that when nearing the 1000 page limit it'll grab a date from one of the images so when it reaches that 1000 page mark, it'll just add the &ecd=yyyy-mm-dd parameter to the search URL.

I know Java, but not Python, so I'm not entirely sure how to go about changing that, neither if I know if it'd work.

Nandaka commented 9 years ago

nope, doesn't work. I think they blocked it already.

And isn't is ecd is the end of the date range, do you means the scd? this doesn't work too. I think you need pixiv premium for it.

danthonywalker commented 9 years ago

The example URL seems to be working for me (don't have Pixiv premium). However, as going forward pages, the page number has to be parsed at the end, but otherwise it's still seems to be working as intended. &order=date_d&ecd=2011-02-21 vs. &order=date_d&ecd=2011-02-21&p=2 and so on.

Trying it with other tags and adding that to the end of the URL also works

Edit: It doesn't appear to work for tags that don't have any content before the set date. For example, the tag 艦これ doesn't work for it, but changing the year to 2015 worked it out. I assume it didn't work because there was no content before the date 2011-02-21 (which is entirely reasonable for that tag)

Nandaka commented 9 years ago

I see, so if you search tag under descending date (newest first), after it hit page 1000, you loop back to page 1 with the ecd set to the last image from the page 1000.

For example: http://www.pixiv.net/search.php?word=%E8%89%A6%E3%81%93%E3%82%8C&order=date_d&p=1000 Last image is 5/26/2015 21:33

then you loop back to http://www.pixiv.net/search.php?word=%E8%89%A6%E3%81%93%E3%82%8C&order=date_d&p=1000&ecd=2015-05-26

until no more image.

I don't know if they will fix this or not, because the scd option already limited to last month only.

danthonywalker commented 9 years ago

Because there's constantly new content being uploaded I think a safer bet would use the last date for the image on page 950 or something, doesn't have to be the exactly last image (since the database automatically skips already downloaded images).

But yeah, that's the idea.

Nandaka commented 9 years ago

Try this one: http://www.mediafire.com/download/9ko997sbyssyc9a/pixivutil20150705-beta1.7z

you need to set enableInfiniteLoop = True in config.ini.

danthonywalker commented 9 years ago

2015-07-04 12:40:06,844 - PixivUtil20150705-beta1 - INFO - Looping... for http://www.pixiv.net/search.php?s_mode=s_tag_full&word=%E6%9D%B1%E6%96%B9&p=1000&order=date_d 2015-07-04 12:40:08,635 - PixivUtil20150705-beta1 - INFO - Last page: 1000 2015-07-04 12:40:08,638 - PixivUtil20150705-beta1 - INFO - Hit page 1000, looping back to page 1 with ecd: None 2015-07-04 12:40:08,640 - PixivUtil20150705-beta1 - ERROR - Error at process_tags(): (<type 'exceptions.TypeError'>, TypeError("cannot concatenate 'str' and 'NoneType' objects",), <traceback object at 0x02E2FF80>) Traceback (most recent call last): File "PixivUtil2.py", line 815, in process_tags TypeError: cannot concatenate 'str' and 'NoneType' objects 2015-07-04 12:40:08,641 - PixivUtil20150705-beta1 - ERROR - Cannot dump page for search tags:東方 2015-07-04 12:40:08,644 - PixivUtil20150705-beta1 - ERROR - Unknown Error: global name 'ex' is not defined Traceback (most recent call last): File "PixivUtil2.py", line 1775, in main File "PixivUtil2.py", line 1572, in main_loop File "PixivUtil2.py", line 1430, in menu_download_from_tags_list File "PixivUtil2.py", line 846, in process_tags_list NameError: global name 'ex' is not defined

Edit: I assume the source of this error might be in the fact I already have the last many hundred pages of the tag already downloaded (because I assume you check the ID before anything else). So because it just skips, it never grabs the date to add to the URL.

That's just an assumption though, but programmatically I can see it happening.

Nandaka commented 9 years ago

try: http://www.mediafire.com/download/njy1pxz0qaccic5/pixivutil20150705-beta2.7z

danthonywalker commented 9 years ago

Seems like it's working. The dates the script is grabbing from are past that from page 1000. I'll let it run for a few days and see what happens

danthonywalker commented 9 years ago

So it initially worked when I started it at page 999. However, if I start it from the very beginning it doesn't work.

PixivUtil20150705-beta1 - INFO - Looping... for http://www.pixiv.net/search.php?s_mode=s_tag_full&word=%E6%9D%B1%E6%96%B9&p=1000&order=date_d 2015-07-04 16:19:15,891 - PixivUtil20150705-beta1 - INFO - Last page: 1000 2015-07-04 16:19:15,895 - PixivUtil20150705-beta1 - INFO - Searching for: (東方project) %E6%9D%B1%E6%96%B9project 2015-07-04 16:19:15,897 - PixivUtil20150705-beta1 - INFO - Looping... for http://www.pixiv.net/search.php?s_mode=s_tag_full&word=%E6%9D%B1%E6%96%B9project&p=1&order=date_d

Didn't change anything in the configs. The only changed from where it started (page 1 vs. page 999)

Nandaka commented 9 years ago

Your log file is running PixivUtil20150705-beta1, make sure you are using the correct one, also remember to set enableInfiniteLoop = True in config.ini.

Looking your previous log file:

PixivUtil20150705-beta2 - INFO - Looping... for http://www.pixiv.net/search.php?s_mode=s_tag_full&word=%E6%9D%B1%E6%96%B9project&p=419&order=date_d
2015-07-05 07:38:13,888 - PixivUtil20150705-beta2 - INFO - Looping... for http://www.pixiv.net/search.php?s_mode=s_tag_full&word=%E6%9D%B1%E6%96%B9project&p=420&order=date_d
2015-07-05 07:38:15,325 - PixivUtil20150705-beta2 - INFO - Looping... for http://www.pixiv.net/search.php?s_mode=s_tag_full&word=%E6%9D%B1%E6%96%B9project&p=421&order=date_d
2015-07-05 07:38:16,859 - PixivUtil20150705-beta2 - INFO - No more image in the list.
2015-07-05 07:38:16,869 - PixivUtil20150705-beta2 - INFO - Searching for: (touhou) touhou
2015-07-05 07:38:16,869 - PixivUtil20150705-beta2 - INFO - Looping... for http://www.pixiv.net/search.php?s_mode=s_tag_full&word=touhou&p=1&order=date_d

most likely pixiv didn't return the images or got something else.

Maybe I'll make a debug option to dump the html page to file so I can check the actual page being returned.

danthonywalker commented 9 years ago

It's working, just a little glitchy. Tried it again, and same thing happened. When starting from page 1 it doesn't loop back, but starting at 999 works fine (however, it appears that it jumped to 2014 just randomly)

I'm not sure the cause nor where to find it in the debug logs. But in terms of functionality, it's definitely functioning.

Nandaka commented 9 years ago

Try: http://www.mediafire.com/download/b04irzy1ij2zgmq/pixivutil20150706-beta3.7z

I do major reorg on the config.ini, so you may need to set the value again after running (remember to backup your file).

Set enabledump = True and dumptagsearchpage = True so it will save the tags search page and I can check what is the actual page being returned by pixiv.

danthonywalker commented 9 years ago

It appears to have looped back on the date of my system (only explanation I can think of). http___www.pixiv.net_search.php_s_mode=s_tag_full&word=%E6%9D%B1%E6%96%B9&p=1000&order=date_d That was the final entry for the tag's page 1000.

Here is what page 1000 for that tag is. http://www.pixiv.net/search.php?word=%E6%9D%B1%E6%96%B9&s_mode=s_tag_full&order=date_d&p=1000 As it is currently, the images on this page are from 5/23/15.

Here is what it looped on when it went back. http___www.pixiv.net_search.php_s_mode=s_tag_full&word=%E6%9D%B1%E6%96%B9&p=1&ecd=2015-07-06&order=date_d Which is exactly my system's date (at least it's not Pixiv's date, which at the current time of posting is 7/7/2015 for them (and I can confirm that because it started downloading images using the 7/7/2015 date when I initially started the script)).

After reaching page 1000 for that date. http___www.pixiv.net_search.php_s_mode=s_tag_full&word=%E6%9D%B1%E6%96%B9&p=1000&ecd=2015-07-06&order=date_d It just looped back again on the same date (my system date).

This is me just starting at page 1.

So I started it at page 999 again. http_www.pixiv.net_search.php_s_mode=s_tag_full&word=%E6%9D%B1%E6%96%B9&p=999&order=dated Here's the link for page 1000. httpwww.pixiv.net_search.php_s_mode=s_tag_full&word=%E6%9D%B1%E6%96%B9&p=1000&order=date_d

And now it's working fine. http___www.pixiv.net_search.php_s_mode=s_tag_full&word=%E6%9D%B1%E6%96%B9&p=1&ecd=2015-05-23&order=date_d

I haven't tried downloading it starting at page 2. I wonder if the default is causing problems? My friend is trying out this script and saying at starting on page 998 it doesn't fail him either.

Nandaka commented 9 years ago

set enabledump = True and dumptagsearchpage = True on config.ini and upload the html page.

danthonywalker commented 9 years ago

https://goo.gl/ExRdvW

Nandaka commented 9 years ago

Try: http://www.mediafire.com/download/4nsu92bhr5ermwc/pixivutil20150707-beta4.7z

I change the way it get the date.

danthonywalker commented 9 years ago

It's working perfectly. Started at page 1 and it looped to page 1000, got the correct date and looped back. After letting that go through another 1000 pages it has looped back again on the correct date. So I am confident it is now working.