RFC: Database to store download state & metadata

xWTF commented 1 year ago

Background

pixivFANBOX killed many posts recently, several creators I'm supporting moved to fantia, and there's an urgent need to implement new full-automation download & organization. Currently, fantiadl won't store the download state for each post. It sends a request for each post and each content file (even for files that already exist), which slows the download process by a lot and is not suitable for periodic cronjobs.

Solution?

This PR implements a sqlite3 database to hold the download state for fantia post and post_content entries. It also holds the state of each URL downloaded (currently for images only, since fantia uses S3 to store images and uses UUID as the image name, I assume it's safe to do so). Post contents and URLs that are present in the database will be skipped to speed up download & reduce requests sent. Posts will be marked as "complete" when all of its contents are accessible and downloaded to further reduce unnecessary requests. It also fixed tiny mistakes in the perform_download method and change incomplete_filename to full filename with the .tmp suffix, which is ignored by most sync drives by default. BTW, the database functionality is completely optional. It will be enabled only when you provide a path with the --db parameter.

Request for comment

I'm currently storing the metadata required for my automated organization solution only. Since the table structure should be relatively hard to upgrade, please comment if you consider the DB structure needs change before this PR get merged.

bitbybyte commented 1 year ago

I'll be taking a look at merging this in the next few days hopefully.

xWTF commented 1 year ago

Thanks, it has been working flawlessly on my NAS for months. Just checked again to make sure my fork is up to date, no merge conflicts here :)

bitbybyte commented 5 months ago

This should be mostly ready now.

One thing I want to square is mark_incomplete_post with the way we update a post to mark it completed in the DB now. It's probably better that we strictly track what was retrieved rather than what the metadata reports (we check visible_status there) because of things like 404 download links, etc.

bitbybyte commented 5 months ago

Also I'm not sure about:

        if self.db.conn and db_post and db_post["download_complete"]:
            self.output("Post {} already downloaded. Skipping...\n".format(post_id))

This seems problematic if new post contents are added later.

bitbybyte commented 5 months ago

This seems problematic if new post contents are added later.

I changed this check to be after retrieving the initial post JSON so we can verify if converted_at appears to have changed.

Forgot to mention also, I reverted to use .incomplete still. I think that's valuable knowledge for things like sync shares for the people that enable that option, and of course it's more specific than .tmp so you can still choose to exclude it manually.

@xWTF Happy almost one year anniversary to this PR. 🙂 Before I merge would appreciate if you have any thoughts.

xWTF commented 5 months ago

This seems problematic if new post contents are added later.

I agree that this problem exists if new contents got added, however in my use case the creators won't add new content after publishing posts, therefore I'd like to have a switch to control this behaviour:

after retrieving the initial post JSON

By adding a new flag (e.g. --db-check-update), we can cover the requirement of both checking updates and not :)

BTW I see that you changed the URL cache part, I'm not sure if it's a good idea since we can only guarantee that the S3 URLs will have consistent content. The content of any other form of URLs may change overtime and the user may want to update them.

xWTF commented 5 months ago

I've implemented & tested the flag mentioned above in https://github.com/bitbybyte/fantiadl/commit/25f1f08b34324f495e5bee30b922cd70972fe43b, please take a look :)

bitbybyte commented 5 months ago

if self.db.conn allows checking if a DB is actually defined, because self.db is still an object even with no path provided. I would prefer keeping these checks to ensure any outputs specifically about the database aren't seen when people don't specify --db.

I am more in favor of an opposite flag, like --bypass-full-post-check. We should err on the side of not wanting to miss post contents by default.

I'm not sure if it's a good idea since we can only guarantee that the S3 URLs will have consistent content. The content of any other form of URLs may change overtime and the user may want to update them.

Kinda goes back to my point that I dunno we should really be tracking URLs. I'll have to think about this.

xWTF commented 5 months ago

I've made the flag opposite. Also fixed a tiny bug introduced in the previous check change.

I'd like to revert the change you made to url cache eligibility check, how do you think about that?

bitbybyte commented 5 months ago

My preference right now would probably be to remove the url table and that check. If we limit to S3 URLs then we would only get the benefit of that check for images when downloading, right?

It seems like if a post_contents entry is present, we shouldn't have to wory about verifying the URL. The only exception would be metadata like fanclub header, icon, etc.

        if self.db.conn and self.db.is_post_content_downloaded(post_json["id"]):
            self.output("Post content already downloaded. Skipping...\n")
            return True

xWTF commented 5 months ago

I assume caching S3 URLs (with the previous implementation) are safe for now -- it has been working for like a year and I haven't noticed any issue with 20 fanclubs downloaded automatically. One of my main goals of this PR is to reduce as many requests as possible, which obviously includes the header and icon image.

bitbybyte commented 5 months ago

Right, let's keep the URL checks in some manner then.

Everything from cc.fantia.jp and c.fantia.jp appears to be hashed so is it really true that contents can be modified without a URL change, though? Not trying to doubt you but just curious if you've encountered that being the case. I'd really like to avoid hardcoding a CDN URL to check if we should insert. Fanclub headers and such are stored under c.fantia.jp and not cc.fantia.jp, so it doesn't seem like they were ever caught by your initial check (which means they were still being requested anyway before skipping).

xWTF commented 5 months ago

Hmm, fantia's CDN is using UUID for file key so I assumed that each upload creates a new UUID, including an "edit" action (not sure if there's really is this kind of action). Since there's no official docs describes this and I don't have a creator account, I can't guarantee this is 100% correct, but it's very likely to be.

These checks are initially implemented to reduce requests when the download got interrupted, and you continue downloading some content. It's fine to remove these checks if you consider hardcoding CDN url checks is bad practice :)

bitbybyte commented 5 months ago

Got it, thanks for your detail and work on this! We'll keep things as they are currently. If it seems like the URLs are a concern we can address this in a follow-up.

bitbybyte / fantiadl