Open biggestsonicfan opened 2 years ago
Oh it looks like this has been a known issue for years now. I, uh, guess I might try it myself on a backup database.
I can't actually remember where I saw the issue documented, but since my PixivUtil2 database is the same database since the data log system was introduced almost a decade ago, I can only imagine how many thousands of images I might be missing which may no longer be live on the site. I'm going to go ahead and reopen the issue.
So I'm scanning my database now, and there were a few leftovers from my incident with #861 #889 and #1033, I renamed all save_name
entries in pixiv_master_image
but not pixiv_manga_image
and am running into considerable trouble for it.
While I'm not doing a full "audit" yet (per say), I have just come across 2 missing images that were recorded in the sqlite database, but the files themselves were not downloaded (or somehow went missing). I guess as a suggestion you could implement Clean Database (Fast)
and Clean Database (Full)
as options, where the Full
option would also look at the is_manga
in the pixiv_master_image
table, and if the value is manga
, pass the image_id
from pixiv_master_image
to check the save_name
of pixiv_manga_image
entries of that image_id
(something like SELECT save_name FROM pixiv_manga_image WHERE image_id="the_image_id"
). Perhaps also ensuring that the logged sequence is correct as well. Starting from count 0
to whatever manga page number was saved for the pixiv_master_image
entry, ensure no pages were skipped, as ensuring if the files simply exist isn't enough if a page was somehow skipped and not logged in the database.
I've now come across one member_id
which has entries for image_id
that is not only not theirs, but entire 25 page manga entries have been downloaded from members I have never downloaded from before...
My worst encounter yet: A specific pixiv_manga_image
has the save_name
record of NULL
. The manga is only two pages, and the pixiv_master_image
file exists as well as it's pixiv_manga_image
, but the first page of the manga
is as I said null.
Ran into my worst case scenario. A manga image is missing and the artist has deleted the work.
most likely I won't do anything for this as I don't really uses the db cleanup (I'll just delete the old one and rerun it again), but I'm accepting Pull Request
Alright. I'm doing a full pixiv archive relocation and am finding even more. I will see what I can implement after the move.
I didn't realize the ugoira_view
type for is_manga
also have an entry in pixiv_manga_image
. It appears the .zip
downloaded and the .ugoria
files are identical, in addition to the converted .webm
if it exists...
Ran into an extremely bizzare situation. My save_name
entries were previously %artist% (%member_id%)
, however a manga_type
big
has two separate %artist%
entries for the save_name
between the pixiv_master_image
and pixiv_manga_image
tables...
EDIT: Actually I believe this might have been my down doing
Wow I'm not sure exactly when this happened, but an artist which has about 188 images in their folder somehow got assigned as the save_name
path in the pixiv_manga_image
table to 20,000+ images!
I have now run into my first instance where a multi-page manga entry is in pixiv_manga_image
but has no entry in pixiv_master_image
And that's because _no entries match for that user's member_id
in my database_... yet somehow the manga got added to the table? But the info regarding the manga entries, the image_id
and the title (used in the filename) are correct?!
Alright, the last two comments are related. Apparently. A database clean must have purged all the pixiv_master_image
records, because they weren't at their proper locations, but the files were being downloaded to that worng directory, and were recorded in the pixiv_manga_image
table. This is going to be quite the cluster to fix holy...
EDIT: Sadly I don't have a backup of the database where the pixiv_master_image
record was kept.
My disappointment and frustration have reached a climax at this point. For nearly a decade, I've been using PixivUtil2 daily to create a massive archive, relying heavily on the Clean Up Database
function to clean up mistakes only to find the pixiv_manga_image
table is never touched.
At this point I can only imagine how tainted everyone's pixiv_manga_image
tables really are, and if so, there's almost no point in trying to clean them up if I'm discovering mine to be this chaotic. There just isn't enough information in the pixiv_manga_image
table to restore an entry in pixiv_master_image
table. The lack of member_id
in that table is really hurting right now.
I'm going to be using another api to attempt to structure and organize some of the chaos, but I am fairly certain at this point the database is completely unreliable.
So I have a working theory that the reason I have 20K+ images in one folder and a massive cluster in my database is that I may have accidentally downloaded the bookmarks of an artist instead of downloading the artists works.
The issue may be related to #895 as I've never had autoAddMember
set to True
.
After some strange sleuthing, it turns out the alleged bookmark download did enter the correct folder locations for pixiv illustrations in the database. I found an ID that has been deleted from pixiv in the pixiv_manga_image
table listed under a different folder! It just saved the file to the original artist that the bookmark search had originated. I wonder if that's still the case...
EDIT: This ended up only being the case for a single pixiv illustration id, not all of them...
After some rigorous processing of data, I was able to identify 3772 unique illustration IDs of the 20,904 files in the bookmark dump folder. Of those IDs I managed to identify 2049 different pixiv members associated with the Ids. 116 of the 3772 illustration IDs were no longer live on pixiv and don't have an associated member id.
I have a 20 Gig folder of what I call "pixiv surplus" as one day the program started downloading arbitrary ids in the middle of a "Download by member_id" batch. I fear what effort sorting that folder might take.
I've rewritten my audit script to become much faster as the rate it was going at it would take about 5 weeks to complete. In doing so, I've made a shocking revelation... I've been using PixivUtil2 as an archive tool for so long, I have images archived by the utility that predate the database's inclusion in the software.
I'm going to have to ponder on how to fix this...
Found a file I couldn't seemingly identify that was in the correct artist folder but it didn't follow any naming convention I've seen from PixivUtil2 before. A reverse image search lead me to the artist's twitter account where they linked to a Pixiv Sketch url.
I've also just discovered when I plow through the options that Include Pixiv Sketch
is defaulted to no. I'm absolutely furious. Years of archiving wasted if all those images did not get saved and are probably deleted now. I can't believe this...
Are Pixiv Sketches omitted from the database?!
Is there no default option in the config to always download sketches?!
Coming back to this after a 10 month cooldown. I haven't run a database cleanup yet because I haven't fully sorted my folders yet. There's still a large "incomplete" folder I'm trying to tackle. I've created a plaintext list of "missing" files in the database and am now trying to restore as many of them as I can from files that still exist in the "incomplete" folder.
A strange trend I've noticed with files that weren't properly moved in the first runthrough is that these particular pixiv image ids don't have entries in pixiv_manga_image
, which is what I used in my initial move. As I understand it, an image id should have at minimum two entries: one in pixiv_master_image
and one in pixiv_manga_image
. Using the entries from pixiv_master_image
which still exist, I can create new entries in pixiv_manga_image
so that's not really an issue.
The most concerning thing is when validating that all the pages are there ('pixiv_master_image
Update: The files from these sets appear to have been moved to the new location already, huh...always uses the last page in the
save_name`), I've found 26 manga sets which have at least one missing image in the list.
In the future, I might try to see if I can figure out a way that after a PixivUtil2 download option is run, it pulls from a list of stored failed downloads and gives them one last retry. I feel like this would be useful for "hiccups" that go unnoticed by batch downloads.
Still sorting, and it looks like 49 folders (pixiv member ids) have no database entries. I am scanning my backup databases now.
EDIT: Okay, some of these just contain a "folder" image, which is the profile picture of the pixiv user, this could mean they never submitted any works in the first place, which is fine, not sure how they would have gotten included in my downloads, but sure.
Prerequisites
Description
At one point during my downloading of a particular user, an image did not download properly and was corrupted. I sent that image to Trash and let PixivUtil2 do it's database cleanup. It did not report the image was missing. I thought this was curious, so I verified the image was no longer in the folder and using an SQL browser, verified the page's entry was still there.
Steps to Reproduce
image_id
89522121 (page 26 is what I deleted and need to replace)Clean Up Database
Expected behavior: PixivUtil2's database cleaner should recognize when pages are missing from manga downloads and remove them from the database.
Actual behavior: PixivUtil2 doesn't seem to scan the individual pages of mangas?
Log file: pixivutil.log
Version
PixivDownloader2 version 20211104