MalloyDelacroix / DownloaderForReddit

The Downloader for Reddit is a GUI application with some advanced features to extract and download submitted content from reddit.
GNU General Public License v3.0
505 stars 47 forks source link

Support for Migrating 2.x DB to 3.x #150

Closed redpoptarts closed 3 years ago

redpoptarts commented 3 years ago

The only way I could find to move settings from the old app to the new one was to export lists via TXT file. (The v2 JSON / CSV / etc formats are not supported for v3 import) However, this is only a partial solution. When I download my old lists in v3, all of the old files download duplicates because:

In its current state, I cannot use DownloaderForReddit v3 until this is resolved. If there was a way to retain the old download list, or limit posts by date, or properly avoid duplicates that already exist on the HD (preferred solution, this would be awesome!) then that will allow me to begin using v3 of your amazing program. <3

(Thanks for all the hard work, this new v3 looks great and has lots of cool features I've wanted for a while!)

SomeRandomDude870 commented 3 years ago

Same here, Since I accidentally deleted my database it would have to download everything anew. I would recommend that the software has a function to read a folder and add all of it´s content to the database. Since it was produced in subreddit and username folders it already has it´s first values. When downloading something new it just could check if something old and "new" are equal and then update the old record. Although, does reddit send a Checksum before something gets downloaded?... However, I also have stuff from deleted Users, so it would be nice to be able to view them in the database directly.

redpoptarts commented 3 years ago

I agree, there are a lot of potential feature-requests present here, but the most helpful and straight-forward one would be to avoid making a duplicate file + database entry if the file already exists with the same name & filesize/checksum.

This logic could be applied conditionally based on the Avoid Duplicates flag.... OR A new option could be added which asks how to handle conflicting filename: [Skip, Overwrite, Rename]

That would get me up and running. There's still some missed coverage for deleted users/posts, which is SomeRandomDude870's concern. It would be nice to be able to scan a folder and populate the database, but it sounds like stronger database management tools are a separate concern, and more should be on the way in the future.

MalloyDelacroix commented 3 years ago

This base issue has been mostly fixed. At least to the extent possible. You can import a json file that has been exported from version 2.X.X. The users/subreddits imported will have the correct date limit, as well as most of the other settings that can be set in version 2. There are a lot of settings that have incompatibly changed (the way downloads are named for instance) or settings that are completely new. Many of these settings will not be able to be imported from old export files.

The additional comments about duplicate downloads are really separate issues, but I will address them here for the time being.

Some of the problem is resolved in being able to import from a json file. It will set the date limit to the last date limit downloaded with the old version. This will keep files before this date limit from being downloaded again, avoiding most of the duplicate problem.

It is not feasible to import existing files into the database. The file path and extension would be the only information available by reading these files which are only two of many rows in the database table for content downloads. Due to the adjustable way in which downloads can be stored and named, there would be no way to back-step usable information from the file alone. It is not guaranteed that information such as the title, author, subreddit, creation date, or associated post are contained in the folder structure or file name.

I have considered making it an option to avoid downloading files with post title that are already in the database, but depending on the user or subreddit being downloaded, this is not a very feasible option either. Some subreddits dictate that all posts have the same name for instance. This would lead to a lot of missed posts.

The final issue of avoiding duplicates based on checksum is a feature I have been trying to work out, and have been unsuccessful in doing so. The problem here is that a checksum is not available from any of the container sites that are downloaded from (that I'm aware of anyway) at the time of download. So to make this feature work, a file must be downloaded, hashed, then compared to existing entries stored in the database. This means that the file must be downloaded and stored on the hard drive, then deleted if a duplicate is found. This would add a lot of complexity and time to the download process and would be a very inefficient way to avoid a duplicate. I am still considering making this a feature (or at least some version of it) if I can find a way to make it work reliably.

redpoptarts commented 3 years ago

Awesome thanks for the detailed response. I'll see what I can do for a workaround for the time being.