RipMeApp / ripme-designs

Design discussion for ripmeapp features
MIT License
2 stars 3 forks source link

Long-term re-rip sustainability #1

Open metaprime opened 5 years ago

metaprime commented 5 years ago

This proposal is a work in progress but something I would like to work on. Feedback is welcome.

Primary Scenario

Following a reddit user and re-ripping their content across multiple versions of ripme.

Some users are always adding new content, and we are constantly updating the save-filename format to feature requests. I don't think we should slow progress on this, but we should allow users to opt-in to the latest filename format without having to start their ripped folder from scratch (save time and bandwidth).

I also think that we should not prevent users from deleting or moving files into a subdirectory or renaming files (to add descriptions, tags, etc.), but doing so without re-ripping content is a complex task which is not possible at the moment. Re-ripping should not undo the work of cleanup and categorizing files. This is the biggest blocker for me to keep my collection organized while using ripme.

Metadata Solution

I propose a per-rip-directory metadata file (suggest JSON for flexibility, ease of use, and user-readability) which includes the URL (of the actual image file that was downloaded), the filename it was ripped to, and a hash of the file's contents.

URL

We can ensure that new files are not downloaded from the same URLs (per-rip).

If we are not careful, this functionality might conflict with the algorithm for global rip history file. When new files are downloaded they can be checked against the existing file hashes to

Filename

Filename & Content Hash

When re-ripping, if there is still a file at a given filename, we do not need to take any action.

If the filename doesn't exist anymore, we can first assume that the file has moved and try to find it. Check for filenames that are not known to ripme (via metadata) and check the hash of their content against the hash of the "missing" file in the metadata. If the file is found (in the entire subtree of the directory), update the filename, if the file is not found, mark the file as <deleted> in the metadata, so we will not try to re-download the file on future re-rips.

We can also use the content hash to de-dupe the output in case the same file was downloaded from multiple different URLs (if, perhaps, multiple URLs on the same site like imgur could lead to the same file, and a user uploaded with a different URL in multiple posts). It is easier to de-dupe after-the-fact than to detect every way in which duplicates could be generated. De-dupe should update the metadata to include multiple source URLs so that we don't re-rip duplicates.

Filename Change Solution

Do not re-rip old content unless explicitly requested. If explicitly requested (there's already an option for this), old content would be updated to the new filename. We would match up the data source (like title of a post being added to a title) to the actual file on disk by using the metadata to match up the URL and/or downloading the file and matching the content, if it can be found, or a metadata entry is found e.g. <deleted>).

Renaming every file could throw away categorization work (moving files into folders or renaming them). If a filename has changed from the original, the manually updated filename should be kept, or the user should be asked if they would like to rename the file -- being given an option to take the new title, take both titles (new first to preserve name-sorting of the new titles and preserve the manually changed title without asking the user to write new titles for potentially a lot of content), or see the old and new titles and write a new one.

This requires keeping information in the metadata about whether a filename is original or changed. When a file is detected as a move/rename above, that would be a good time to add this bit of data (JSON boolean "isRenamed": true).

A user can also explicitly force renames of all the content as if a fresh rip had been done with the new titles, throwing away all renames and categorization work (probably preserving file deletes, or have a separate option for this as well), and take the new titles as renames.

Potential Problems

Related Issues

https://github.com/RipMeApp/ripme/issues/293

Example

TODO

Tagging @cyian-1756 for feedback

cyian-1756 commented 5 years ago

Looks pretty well though out to me

suggest JSON for flexibility, ease of use, and user-readability

If we're looking for user-readability we might want to check out yaml as well

I propose a per-rip-directory metadata file (suggest JSON for flexibility, ease of use, and user-readability) which includes the URL (of the actual image file that was downloaded), the filename it was ripped to, and a hash of the file's contents.

What are you thinking for the hash? Considering that ripme does support sites that host large video files I think we should use CRC-32 for the first check and if 2 files have the same CRC 32 hash we then check using SHA1

I also think storing only the hash of the entire file is a mistake. It would be better to store the hash of the chunks of the file (Where a chunk is a X bytes large chunk of the file) this would allow us to speed up checking for dupes (Because instead of hashing the entire file to check for dupes for most we'd just need to check the first chunk of the 2 files)

Another way to speed up hashing and checking for dupes would be to store the file size in the metadata. This would allow us to skip all files that can't be dupes of one another (ignoring partially downloaded dupes)

Potential Problems

I think we have to watch out for any long run times/high cpu usage that this might cause. No one is going to want to use this feature if it's slow/slows down the entire computer

Example

So the example JSON file should look something like

{
    url: "https://somesite.tld/images"
    {
        "https://somesite.tld/images/image1.png" 
        {
            is_renamed: false,
            size: 0,
            crc32: "hash",
            sha1: "hash",
           "chunks" 
            {
                1: CRC-32 hash,
                2: CRC-32 hash
             }
        }
    }
}