brokkr / poca

A fast, multithreaded and highly customizable command line podcast client, written in Python 3
GNU General Public License v3.0
23 stars 4 forks source link

Save entry uid and channel identification to metadata #139

Closed brokkr closed 3 years ago

brokkr commented 3 years ago

An issue with a database of files is that they can get out of sync if the user manipulates the files. Or downloads break off. Or the program crashes.

The only thing we really need to know about already downloaded files is 1) to recognise that they are already downloaded so that they may be deleted and we can clean up after ourselves 2) to recognise them in the feed, so that we know which are the same file (really: to recognise that a file in Wanted is already here and we don't need to download it again)

In all cases where we need any more info than that - and I can't think of any - we can get it from the feed.

All that is needed for 1 is that the file is present in scope of the media folder.

All that is needed for 2 is that we can extract a UID for the entry (and ideally for the channel/subscription so that we don't rely on folder names) from the file itself. I don't trust file names. I don't trust the comment field. But with mutagen I can make custom (comment) frames, right? One for the UID, one for the channel. As long as the user doesn't rip up the metadata, we're golden.

Advantages

Drawbacks

Possible issues

brokkr commented 3 years ago

Idea for preserving knowledge of user_deleted:

We keep a simple list of UIDs for every file we succesfully download. If a file is removed automatically, the uid gets deleted from the list. In effect: every subupgrade will drop a tuple into a queue (removed uids, added uids) When all subupgrade threads have been joined, the queue is used to update the list. The list is then written out to a simple text file. When the script starts again the text file is read back in as the list. If a uid is 1) in the list 2) not found on disk and 3) otherwise in Wanted, we conclude that it was there and is now gone. We then need a second list (dammit) of user_deleted uids....

brokkr commented 3 years ago

Alternatively: Add an attribute to from_the_top called "batch". By default it's 1. So you get the first batch of max_number episodes. When you're done with those, you don't even need to delete the files, you just change batch to 2. Poca then moves the needle to second batch of max_number.

Of course you could get the same effect using after_date filter. Or "batch" could just be a filter, that could technically be used with either direction.

It would need to be specified whether the batch calculation took it's starting point at the start/end of the feed or your last batch (which would differ if the feed is gradually dropping old episodes, e.g. Savage Love, No Such Thing, etc.)

brokkr commented 3 years ago

Problem no. 2: Etag and Modified.

Where do we store those without a database?

brokkr commented 3 years ago

Even if the idea isn't as genius as first it may have appeared, there is a more general point:

Save specific data in a specific way for a specific purpose. Don't just jam everything down a pickle. We could end up with a text file, some metadata and an sqlite file... if that is best suited for each individual purpose.

brokkr commented 3 years ago

Also note: #128 YAML or JSON may be a simple way to store structured data as well.

Which means that #139 should only be decided after #128

brokkr commented 3 years ago

Supposing something like YAML would work, a file could play host to a relatively flat structure consisting of

Note that a flat list is actually to be preferred over a sub dictionary as that evades the problem of channel ids and the risk of UIDs overlapping between channels is exceedingly small. And the added computing power to test against a few more UIDs is tiny.

None of these data are in any way vital to operations. Say the yaml file got deleted and we ran poca...

We would then create a new file. If the user really meant for the files to be deleted we would quickly get up to speed on that (OK, so if you're forty episodes into Welcome to Night Vale, and have to redownload and delete 40 episodes, it might be a pain... But that would also be the case with the current arrangement)

As for current files, it would still reside with metadata. The dividing line is between single entry data and more general data. Single entry data reside with the file. It is unlikely to get separated from the file. The YAML data are really all subscription level data (even if we don't care to separate UIDs by subscription).

That subscription level data are not essential is coincidental. But it is a coincidence we can use to not invest in industrial strength database stuff for that purpose. File level data are important. Without them we would keep downloading the same file over and over. The fact that there is an obvious non-database solution with some good arguments in favour of it again makes it possible to avoid databases.

brokkr commented 3 years ago

Regarding using a custom frame for the job, TXXX frames would seem ideally suited.

They allow for a 'desc' attribute that goes with 'TXXX:' to make up the frame key, e.g. desc='uid' -> 'TXXX:uid' I can save any string i want to in it. Both eyeD3 and Mp3Tag recognised it and and printed it clearly (Mp3Tag as a 'UID' frame on par with standard frames, eyeD3 as UserTextFrame: [Description: uuid]. VLC included it on the (additional) Metadata tab as uneditable info. Mp3Tag was also able to 'resave' it when adding further information while eyeD3 threw a eyed3.id3.tag.TagException: Unable to convert the following frames to version v2.4: UID (might be a bug - there is no real conversion involved, as the existing tag was 2.4 and the frame encoded as UTF-8)

While it proves the viability of the idea, it doesn't:

Next step:

brokkr commented 3 years ago

In all cases where we need any more info than that - and I can't think of any - we can get it from the feed.

There is the crucial flaw in the argument: We need the info that is order of entries and we cannot rely on the feed for that because feeds get trimmed. Because feeds get trimmed we also risk losing the link between ids and channel/feed since the solution woul dissolve it.

brokkr commented 3 years ago

Basically, we would need to load not one but three values onto the metadata: ID, date/order and feed/channel. And we would still need to save feed state (etag/modified and active/inactive) somewhere else.

It might be tempting to implement the ID solution by itself - as a way to brand them cows - but the value of it is negligible. It would allow a user to rename or move files afterward and still have them recognised. And it would mean we could dispense with printing filepaths to yaml, opting instead to just use a list of IDs and a treewalk. But it would introduce a whole new way to fail.