Nandaka / PixivUtil2

Download images from Pixiv and more!
http://nandaka.devnull.zone/
BSD 2-Clause "Simplified" License
2.39k stars 254 forks source link

Database Cleanup Not Cleaning Removed Manga Page from Database #1056

Open biggestsonicfan opened 2 years ago

biggestsonicfan commented 2 years ago

Prerequisites

Description

At one point during my downloading of a particular user, an image did not download properly and was corrupted. I sent that image to Trash and let PixivUtil2 do it's database cleanup. It did not report the image was missing. I thought this was curious, so I verified the image was no longer in the folder and using an SQL browser, verified the page's entry was still there.

Steps to Reproduce

  1. While the specifics of a manga format pixiv work should be relatively similar, download by image_id 89522121 (page 26 is what I deleted and need to replace)
  2. Delete a single page within the manga.
  3. Run Clean Up Database
  4. PixivUtil2 should report no images missing.

Expected behavior: PixivUtil2's database cleaner should recognize when pages are missing from manga downloads and remove them from the database.

Actual behavior: PixivUtil2 doesn't seem to scan the individual pages of mangas?

Log file: pixivutil.log

Version

PixivDownloader2 version 20211104

biggestsonicfan commented 2 years ago

Oh it looks like this has been a known issue for years now. I, uh, guess I might try it myself on a backup database.

biggestsonicfan commented 2 years ago

I can't actually remember where I saw the issue documented, but since my PixivUtil2 database is the same database since the data log system was introduced almost a decade ago, I can only imagine how many thousands of images I might be missing which may no longer be live on the site. I'm going to go ahead and reopen the issue.

biggestsonicfan commented 2 years ago

So I'm scanning my database now, and there were a few leftovers from my incident with #861 #889 and #1033, I renamed all save_name entries in pixiv_master_image but not pixiv_manga_image and am running into considerable trouble for it.

biggestsonicfan commented 2 years ago

While I'm not doing a full "audit" yet (per say), I have just come across 2 missing images that were recorded in the sqlite database, but the files themselves were not downloaded (or somehow went missing). I guess as a suggestion you could implement Clean Database (Fast) and Clean Database (Full) as options, where the Full option would also look at the is_manga in the pixiv_master_image table, and if the value is manga, pass the image_id from pixiv_master_image to check the save_name of pixiv_manga_image entries of that image_id (something like SELECT save_name FROM pixiv_manga_image WHERE image_id="the_image_id"). Perhaps also ensuring that the logged sequence is correct as well. Starting from count 0 to whatever manga page number was saved for the pixiv_master_image entry, ensure no pages were skipped, as ensuring if the files simply exist isn't enough if a page was somehow skipped and not logged in the database.

biggestsonicfan commented 2 years ago

I've now come across one member_id which has entries for image_id that is not only not theirs, but entire 25 page manga entries have been downloaded from members I have never downloaded from before...

biggestsonicfan commented 2 years ago

My worst encounter yet: A specific pixiv_manga_image has the save_name record of NULL. The manga is only two pages, and the pixiv_master_image file exists as well as it's pixiv_manga_image, but the first page of the manga is as I said null.

biggestsonicfan commented 2 years ago

Ran into my worst case scenario. A manga image is missing and the artist has deleted the work.

Nandaka commented 2 years ago

most likely I won't do anything for this as I don't really uses the db cleanup (I'll just delete the old one and rerun it again), but I'm accepting Pull Request

biggestsonicfan commented 2 years ago

Alright. I'm doing a full pixiv archive relocation and am finding even more. I will see what I can implement after the move.

biggestsonicfan commented 2 years ago

I didn't realize the ugoira_view type for is_manga also have an entry in pixiv_manga_image. It appears the .zip downloaded and the .ugoria files are identical, in addition to the converted .webm if it exists...

biggestsonicfan commented 2 years ago

Ran into an extremely bizzare situation. My save_name entries were previously %artist% (%member_id%), however a manga_type big has two separate %artist% entries for the save_name between the pixiv_master_image and pixiv_manga_image tables...

EDIT: Actually I believe this might have been my down doing

biggestsonicfan commented 2 years ago

Wow I'm not sure exactly when this happened, but an artist which has about 188 images in their folder somehow got assigned as the save_name path in the pixiv_manga_image table to 20,000+ images!

biggestsonicfan commented 2 years ago

I have now run into my first instance where a multi-page manga entry is in pixiv_manga_image but has no entry in pixiv_master_image

And that's because _no entries match for that user's member_id in my database_... yet somehow the manga got added to the table? But the info regarding the manga entries, the image_id and the title (used in the filename) are correct?!

biggestsonicfan commented 2 years ago

Alright, the last two comments are related. Apparently. A database clean must have purged all the pixiv_master_image records, because they weren't at their proper locations, but the files were being downloaded to that worng directory, and were recorded in the pixiv_manga_image table. This is going to be quite the cluster to fix holy...

EDIT: Sadly I don't have a backup of the database where the pixiv_master_image record was kept.

biggestsonicfan commented 2 years ago

My disappointment and frustration have reached a climax at this point. For nearly a decade, I've been using PixivUtil2 daily to create a massive archive, relying heavily on the Clean Up Database function to clean up mistakes only to find the pixiv_manga_image table is never touched.

At this point I can only imagine how tainted everyone's pixiv_manga_image tables really are, and if so, there's almost no point in trying to clean them up if I'm discovering mine to be this chaotic. There just isn't enough information in the pixiv_manga_image table to restore an entry in pixiv_master_image table. The lack of member_id in that table is really hurting right now.

I'm going to be using another api to attempt to structure and organize some of the chaos, but I am fairly certain at this point the database is completely unreliable.

biggestsonicfan commented 2 years ago

So I have a working theory that the reason I have 20K+ images in one folder and a massive cluster in my database is that I may have accidentally downloaded the bookmarks of an artist instead of downloading the artists works.

The issue may be related to #895 as I've never had autoAddMember set to True.

biggestsonicfan commented 2 years ago

After some strange sleuthing, it turns out the alleged bookmark download did enter the correct folder locations for pixiv illustrations in the database. I found an ID that has been deleted from pixiv in the pixiv_manga_image table listed under a different folder! It just saved the file to the original artist that the bookmark search had originated. I wonder if that's still the case...

EDIT: This ended up only being the case for a single pixiv illustration id, not all of them...

biggestsonicfan commented 2 years ago

After some rigorous processing of data, I was able to identify 3772 unique illustration IDs of the 20,904 files in the bookmark dump folder. Of those IDs I managed to identify 2049 different pixiv members associated with the Ids. 116 of the 3772 illustration IDs were no longer live on pixiv and don't have an associated member id.

I have a 20 Gig folder of what I call "pixiv surplus" as one day the program started downloading arbitrary ids in the middle of a "Download by member_id" batch. I fear what effort sorting that folder might take.

biggestsonicfan commented 2 years ago

I've rewritten my audit script to become much faster as the rate it was going at it would take about 5 weeks to complete. In doing so, I've made a shocking revelation... I've been using PixivUtil2 as an archive tool for so long, I have images archived by the utility that predate the database's inclusion in the software.

I'm going to have to ponder on how to fix this...

biggestsonicfan commented 2 years ago

Found a file I couldn't seemingly identify that was in the correct artist folder but it didn't follow any naming convention I've seen from PixivUtil2 before. A reverse image search lead me to the artist's twitter account where they linked to a Pixiv Sketch url.

I've also just discovered when I plow through the options that Include Pixiv Sketch is defaulted to no. I'm absolutely furious. Years of archiving wasted if all those images did not get saved and are probably deleted now. I can't believe this...

Are Pixiv Sketches omitted from the database?!

Is there no default option in the config to always download sketches?!

biggestsonicfan commented 2 years ago

Coming back to this after a 10 month cooldown. I haven't run a database cleanup yet because I haven't fully sorted my folders yet. There's still a large "incomplete" folder I'm trying to tackle. I've created a plaintext list of "missing" files in the database and am now trying to restore as many of them as I can from files that still exist in the "incomplete" folder.

A strange trend I've noticed with files that weren't properly moved in the first runthrough is that these particular pixiv image ids don't have entries in pixiv_manga_image, which is what I used in my initial move. As I understand it, an image id should have at minimum two entries: one in pixiv_master_image and one in pixiv_manga_image. Using the entries from pixiv_master_image which still exist, I can create new entries in pixiv_manga_image so that's not really an issue.

The most concerning thing is when validating that all the pages are there ('pixiv_master_imagealways uses the last page in thesave_name`), I've found 26 manga sets which have at least one missing image in the list. Update: The files from these sets appear to have been moved to the new location already, huh...

In the future, I might try to see if I can figure out a way that after a PixivUtil2 download option is run, it pulls from a list of stored failed downloads and gives them one last retry. I feel like this would be useful for "hiccups" that go unnoticed by batch downloads.

biggestsonicfan commented 1 year ago

Still sorting, and it looks like 49 folders (pixiv member ids) have no database entries. I am scanning my backup databases now.

EDIT: Okay, some of these just contain a "folder" image, which is the profile picture of the pixiv user, this could mean they never submitted any works in the first place, which is fine, not sure how they would have gotten included in my downloads, but sure.