Save entry uid and channel identification to metadata

brokkr commented 3 years ago

An issue with a database of files is that they can get out of sync if the user manipulates the files. Or downloads break off. Or the program crashes.

The only thing we really need to know about already downloaded files is 1) to recognise that they are already downloaded so that they may be deleted and we can clean up after ourselves 2) to recognise them in the feed, so that we know which are the same file (really: to recognise that a file in Wanted is already here and we don't need to download it again)

In all cases where we need any more info than that - and I can't think of any - we can get it from the feed.

All that is needed for 1 is that the file is present in scope of the media folder.

All that is needed for 2 is that we can extract a UID for the entry (and ideally for the channel/subscription so that we don't rely on folder names) from the file itself. I don't trust file names. I don't trust the comment field. But with mutagen I can make custom (comment) frames, right? One for the UID, one for the channel. As long as the user doesn't rip up the metadata, we're golden.

Advantages

No more risk of db/files out-of-sync
No more pickle? At least subjars would be a thing of the past. Statejars?
More portability: Your files are your db
Solves #134 : If we don't pickle the sub, we won't run into problems unpickling
Solves #122 : Existing files are the files that are there, not a string that has to be encoded and match the bytes on a drive

Drawbacks

User has no interest in messing with db folder files, but metadata are subject to user manipulation. Is it possible to craft a frame that most metadata editors/viewers won't show or ignore?
How to establish user_deleted? A file that isn't there, just isn't there. There is no telling if it was never there or if it was there and was user deleted.
Consequence: the from_the_top gradual approach doesn't work. That is however, the only real use case for detecting user_deleted, mind you.

Possible issues

With regard to channel identification, be aware of the possible difficulties 301 statuses (and poca's url auto-update behaviour) can bring... Do we run a scan on all files using the old URL as channel id and update it to the new one?
More importantly: I know ID3 can, I know that vorbis comments can, but MP4?
Encoding in frame an issue? Unlikely, as we would ask mutagen to read and write for us, and mutagen responds with and accepts strings, not bytes.

brokkr commented 3 years ago

Idea for preserving knowledge of user_deleted:

We keep a simple list of UIDs for every file we succesfully download. If a file is removed automatically, the uid gets deleted from the list. In effect: every subupgrade will drop a tuple into a queue (removed uids, added uids) When all subupgrade threads have been joined, the queue is used to update the list. The list is then written out to a simple text file. When the script starts again the text file is read back in as the list. If a uid is 1) in the list 2) not found on disk and 3) otherwise in Wanted, we conclude that it was there and is now gone. We then need a second list (dammit) of user_deleted uids....

brokkr commented 3 years ago

Alternatively: Add an attribute to from_the_top called "batch". By default it's 1. So you get the first batch of max_number episodes. When you're done with those, you don't even need to delete the files, you just change batch to 2. Poca then moves the needle to second batch of max_number.

Of course you could get the same effect using after_date filter. Or "batch" could just be a filter, that could technically be used with either direction.

It would need to be specified whether the batch calculation took it's starting point at the start/end of the feed or your last batch (which would differ if the feed is gradually dropping old episodes, e.g. Savage Love, No Such Thing, etc.)

brokkr commented 3 years ago

Problem no. 2: Etag and Modified.

Where do we store those without a database?

brokkr commented 3 years ago

Even if the idea isn't as genius as first it may have appeared, there is a more general point:

Save specific data in a specific way for a specific purpose. Don't just jam everything down a pickle. We could end up with a text file, some metadata and an sqlite file... if that is best suited for each individual purpose.

brokkr commented 3 years ago

Also note: #128 YAML or JSON may be a simple way to store structured data as well.

Which means that #139 should only be decided after #128

brokkr commented 3 years ago

Supposing something like YAML would work, a file could play host to a relatively flat structure consisting of

A flat list of current UIDs used to dianose user deletions
A flat list of "blocklisted", user deleted UIDs
A dictionary of ephemeral, Subscription level data: Etags, Modified, ...

Note that a flat list is actually to be preferred over a sub dictionary as that evades the problem of channel ids and the risk of UIDs overlapping between channels is exceedingly small. And the added computing power to test against a few more UIDs is tiny.

None of these data are in any way vital to operations. Say the yaml file got deleted and we ran poca...

We wouldn't be able to tell if there were any recently deleted files
We wouldn't be able to recognise old deleted files
We would one time bother some websites we otherwise wouldn't have bothered

We would then create a new file. If the user really meant for the files to be deleted we would quickly get up to speed on that (OK, so if you're forty episodes into Welcome to Night Vale, and have to redownload and delete 40 episodes, it might be a pain... But that would also be the case with the current arrangement)

As for current files, it would still reside with metadata. The dividing line is between single entry data and more general data. Single entry data reside with the file. It is unlikely to get separated from the file. The YAML data are really all subscription level data (even if we don't care to separate UIDs by subscription).

That subscription level data are not essential is coincidental. But it is a coincidence we can use to not invest in industrial strength database stuff for that purpose. File level data are important. Without them we would keep downloading the same file over and over. The fact that there is an obvious non-database solution with some good arguments in favour of it again makes it possible to avoid databases.

brokkr commented 3 years ago

Regarding using a custom frame for the job, TXXX frames would seem ideally suited.

They allow for a 'desc' attribute that goes with 'TXXX:' to make up the frame key, e.g. desc='uid' -> 'TXXX:uid' I can save any string i want to in it. Both eyeD3 and Mp3Tag recognised it and and printed it clearly (Mp3Tag as a 'UID' frame on par with standard frames, eyeD3 as UserTextFrame: [Description: uuid]. VLC included it on the (additional) Metadata tab as uneditable info. Mp3Tag was also able to 'resave' it when adding further information while eyeD3 threw a eyed3.id3.tag.TagException: Unable to convert the following frames to version v2.4: UID (might be a bug - there is no real conversion involved, as the existing tag was 2.4 and the frame encoded as UTF-8)

While it proves the viability of the idea, it doesn't:

Prove that it's efficient - reading metadata from possibly hundreds of files might be really slow
Prove that it's reliable - just cause my tools deal with it nicely doesn't mean other people's tools will too

Next step:

implementing a save to metadata function with dl (ideally with inbuilt timing to test inefficiency)
implementing their use in subupgrade
testing with large numbers of files

brokkr commented 3 years ago

In all cases where we need any more info than that - and I can't think of any - we can get it from the feed.

There is the crucial flaw in the argument: We need the info that is order of entries and we cannot rely on the feed for that because feeds get trimmed. Because feeds get trimmed we also risk losing the link between ids and channel/feed since the solution woul dissolve it.

brokkr commented 3 years ago

Basically, we would need to load not one but three values onto the metadata: ID, date/order and feed/channel. And we would still need to save feed state (etag/modified and active/inactive) somewhere else.

It might be tempting to implement the ID solution by itself - as a way to brand them cows - but the value of it is negligible. It would allow a user to rename or move files afterward and still have them recognised. And it would mean we could dispense with printing filepaths to yaml, opting instead to just use a list of IDs and a treewalk. But it would introduce a whole new way to fail.

brokkr / poca