Quality-based Trumping during import

ghost commented 11 years ago

This issue was automatically migrated from Google Code. Original author: telejes...@gmail.com (February 16, 2012 01:16:19) Original issue: https://github.com/google-code-export/beets/issues/341

hchapman commented 11 years ago

This would be superb. In my spare time I was going to try to write something like this.

sampsyo commented 11 years ago

Awesome! Now that we have a central place in the importer workflow where replacement decisions are made, this should be relatively straightforward to implement -- it's just a matter of coming up with the right design for how it should work.

On Wed, Apr 3, 2013 at 9:47 AM, Harrison Chapman notifications@github.com wrote:

This would be superb. In my spare time I was going to try to write something like this.

Reply to this email directly or view it on GitHub: https://github.com/sampsyo/beets/issues/116#issuecomment-15848567

hchapman commented 11 years ago

This is my first working with beets, so mind how naive I am.

I've written a really silly (and still crufty) plugin which tries to use eyed3 to read the Xing header of the mp3 to determine the encoding preset (e.g. v0, cbr 320) and adds a $quality name template variable. It'd be nice to know the presets because if I were quality trumping, I'd personally take v0 over a 320, but a 320 over a v2, and I do not believe the ABR is enough information to do this reliably always.

I'd like to make this into a plugin, since right now it depends on eyed3, which I do not believe beets presently does. Any chance that the plugin docs are out of date, and that the database is now extensible by plugins? It would be nice to be able to beets ls quality:v0 or similar to help sort through the library without having to inspect every single mp3, and I don't if there is another way to implement this type of behavior.

I haven't given any look into modifying the replacement decisions via a plugin yet: my brief scan of the plugin docs didn't seem to include much information on this. I'll have to scan the source.

mister-walter commented 11 years ago

Check out this thread on the Google Groups forum for information on storing arbitrary data in the beets library. Sampsyo indicated that such a feature (flexattrs) might be coming in beets version 1.2.

sampsyo commented 11 years ago

Very cool! Nice work, and not silly at all. :baby_chick: :star2: I really like this approach.

Regarding the dependency on eyed3: yes, we'll need to replace that with an equivalent parser for Mutagen/MediaFile. Do you know how eyed3 determines the LAME preset? Is it by parsing the "encoder" string? If so, then we can re-implement that in a fairly straightforward way.

As @mister-walter mentioned, yes, we're currently in the middle of building support for arbitrary metadata fields in the database. See also #101 for the current status of this feature. I'm planning to make it the first big feature in 1.2, which will start development in earnest soon when 1.1 is out the door (very nearly ready). This is yet another good use case for that feature.

hchapman commented 11 years ago

I think eyed3 uses the xing header which is in the blank first frame of the mp3. Encoders like LAME use it, and possibly some others as well.

I don't know how eyed3 uses this data to determine/guess the encoding preset! It's open source though...

sampsyo commented 11 years ago

Cool. That's worth investigating to see how complex the logic involved is.

hchapman commented 11 years ago

By lifting the LameHeader class and three utility functions from eyed3, I've added super rudimentary LAME header support to mutagen in this branch (super rudimentary means that it just tries to find the LAME header, and spits it out while it opens an mp3 file, with no actual functionality). It's understandable that this code isn't in mutagen, which doesn't seem to be concerned with this extra data (especially since it may or not have an equivalent in any other type of media file), but it was interesting to see how it's certainly not impossible to add it.

To test:

import mutagen.mp3
mutagen.mp3.Open(MP3_FILE_NAME)

sampsyo commented 11 years ago

Cool; thanks. Looks like this header is actually substantially more complex than I originally imagined—the Xing header is yet another arcane binary format. FWIW, there's discussion (years old now) on the Mutagen bug tracker for including similar functionality: http://code.google.com/p/mutagen/issues/detail?id=66

There's a patch but it needs some cleaning up. Maybe the best course of action here would be to pick up that patch, which seems to have been abandoned by the original author, clean it up, add tests, and land it in Mutagen for the next release. (The Mutagen maintainers have been pretty receptive lately.)

Who knew this little bit of data would be so onerous to recover?

hchapman commented 11 years ago

That explains a lot; I didn't even consider mutagen wouldn't be hosted on github, and I didn't think to look at googlecode/bitbucket. I'll give that a look over.

hchapman commented 11 years ago

The patch applies to present master branch! It works for one of my songs, but none of the test files included with mutagen at present have any preset data :(.

hchapman commented 11 years ago

I've put the functionality merged into today's master/default (whatever it's called with hg!) with testing on my Bitbucket fork of mutagen. I've also told the mutagen issues tracker. I haven't heard anything on the issue yet, though.

sampsyo commented 11 years ago

Awesome, thanks for taking action! Let's see what the authors think; if necessary, we can email the Quod Libet mailing list in a few days.

hchapman commented 11 years ago

Well- it does not seem like there is any "way" for a plugin to, at present, influence the importer in checking for duplicates, or even to modify the manual duplicate resolver in the UI (e.g. make it say "we have this album as MP3 in the database, but you're importing a FLAC"). Correct me if I'm wrong! I'm going to think about how to add this broad functionality in.

sampsyo commented 11 years ago

That's correct; the hook doesn't exist yet. We'll need to add it in the resolve_duplicate method of ImportSession in importer.py.

hchapman commented 11 years ago

Ideally the plugin (at least one which knows the user's preference to replace MP3 with FLAC) would be UI-agnostic, and as of now resolve_duplicate just throws a NotImplementedError (and the implementation in ui doesn't call the super method as of yet). My initial thought was to add the hook right before resolve_duplicate is called.

I'm not used to thinking this much about software infrastructure. It's fun! Please let me know if the issues thread is not an appropriate place for my musing...

sampsyo commented 11 years ago

Yes, absolutely—you're right that invoking the plugin before the call to resolve_duplicate makes more sense. That way, subclasses can override that method without replacing this new functionality. Good call.

And yes, absolutely, this issue tracker is the perfect place for this kind of discussion! To me, this is more organized and therefore useful than the mailing list or ad-hoc emails... there's some hope of coming back to the discussion later when we need to remember the decisions we made.

hchapman commented 11 years ago

Moral conundrum: I'm not used to parallel programming so... It'd be nice to add task duplicate data into Importer task, so that plugins could add or remove duplicates automatically by some score calculation... It's easy to me to just carry along duplicate data as, say, a list, rather than calling _check_duplicates twice. But this crumbles under the completely possible scenario where a user has e.g. FLAC and MP3 in the import query for the same album, and the user would like the FLAC to auto trump the MP3. But it's not clear to me that we can just reliably hold on to any ad-hoc album data added to recents in user_query and expect it to still be correct by the time the better quality comes through...

I originally hoped this wouldn't need any sweeping infrastructure changes. Just saying where I'm stumped, and how the coroutine solution is pretty brilliant and still magic to me..

sampsyo commented 11 years ago

Yes, the recent check is a complicated wrinkle in all of this.

One simplification that may help: all tasks hit the database in order. That means that, in the apply_choices coroutine, duplicate albums are already guaranteed to be in the database, even if they were just imported. That means that if the plugin can make all of its decisions there, without user input, then its job is simplified.

So a plugin could prompt the user (or not) based on the limited data available in recent during user_query and then do the heavyweight comparisons later, in apply_choices. Would that make sense?

hchapman commented 11 years ago

Here's a branch with a preliminary solution. I've moved everything dealing with duplicates (including the user query...) to the apply_changes. It's pretty much entirely untested, and I'm presently not really happy with the implementation.

This beetsplug uses it and it also uses my branch of mutagen (about which I've still heard nothing, although I've still only just posted on the googlecode issue) to get preset data. It's a rudimentary quality trumper which prefers FLAC > V0 > 320 > [etc...]

hchapman commented 11 years ago

I've updated my branch so that all tests pass, but I haven't written any new tests. Curious question: there was an issue in _duplicate_check where cur_artist threw an AttributeError... But the same never happened in the item counterpart...

hchapman commented 11 years ago

It just hit me why moving a user query for the duplicate check is not a good solution. I'm also trying to work in a score system for duplicate replacement calculation instead of a binary take-it-or-leave-it

sampsyo commented 11 years ago

Great; I'm glad that makes sense. Yes, all communication with the user has to be done in the same pipeline stage.

hchapman commented 11 years ago

I've changed my strategy to a "trump score" system paralleling the distance that's in now for MB matching. Progress on this branch is preliminary, but works for a really rough manual test I've run myself. Automated tests don't fail, but there are no new ones to test my code.

sampsyo commented 11 years ago

Interesting approach! I like the change of recent from a set to a dict so we keep around the album itself rather than just its identifier. We should definitely use this (but ensure that it doesn't leak memory...).

And the scoring system is interesting too, but I can also see a more specific interface (which might ask "given these albums, which is the best?" instead of "what's the score for this album?") being somewhat simpler. It would push some of the intelligence into the plugins rather than burdening the import flow more, which is already frighteningly complex. Does that make sense?

On Apr 24, 2013, at 1:33 PM, Harrison Chapman notifications@github.com wrote:

I've changed my strategy to a "trump score" system paralleling the distance that's in now for MB matching. Progress on this branch is preliminary, but works for a really rough manual test I've run myself. Automated tests don't fail, but there are no new ones to test my code.

— Reply to this email directly or view it on GitHub.

hchapman commented 11 years ago

The primary reason behind the score would be an (immutable, i.e. not changing after the tracks are imported into the library) number which wouldn't require holding on to any further data of items being imported. Perhaps plugins would be able to declare which properties about recent imports to hold on to? This doesn't really parallel anything in beets that I can think of at present however... Comparing relatively would involve holding on to some of this recent import information specifically, and since the changes haven't been applied yet, I'm not sure how messy this could get.

An example reason for this would be quality trumping-- right now quality trumping involves having the mutagen file info somewhere and I don't know for sure if the actual import/moving into library could cause some sort of rare race conditions when trying to grab/parse this information from tasks long finished.

Certainly with a score system the idea would be to present the user with a choice if the new import score doesn't exceed some threshold (which would ideally only happen if the new import completely eclipses an old; better quality, all the same tracks, same length, same release year...), perhaps with some delta information to explain to the user why we're scoring duplicate choices as such.

sampsyo commented 11 years ago

Scoring seems just fine as a general approach to this problem; I'm just questioning whether the logic that handles scores should be done in importer.py or in qualitytrump.py. If at all possible, it would be great to move as much as possible (e.g., finding the max, totaling up track_scores and album_scores, etc.) to plugins to simplify the importer itself as much as possible. Of course, it's possible there's no clean way to do that.

To avoid races, one approach you could take in the plugin is to calculate scores when import tasks are first created (handling the import_task_start event) so you have all scores available for later comparisons. You can keep an internal map from tasks to scores or just assign your scores right on the task objects (thanks, Python!).

telejester commented 11 years ago

Trump attachment processing is a bit less open-ended by adding the semantic framework of #111. In my original ticket for Trumping on gcode, I mentioned image handling and cue and log files.

In that vein, I just added some more specific ideas to the attachment ticket (gcode 109). In the name of redundancy prevention, please see https://code.google.com/p/beets/issues/detail?id=109#c21

Trump processing needs to consider the attachments of each candidate album and item as they are moved. Images ought to be trumpable as well, with higher-res images replacing lower-res versions of the same image. I pointed at a resolution and scale-independent image comparison library which ought to help a lot in determining "sameness" although I can't vouch for how well it works in real life.

Cue and log files are interesting in that they are only valid if an album consists of the complete set of items they refer to. Any item or album trumping means the cue and log have to go, even if the new version doesn't have replacements.

telejester commented 11 years ago

I also have no idea what to do with assets which have been trumped. I suppose the options are "delete" (which I find terrifying) and "mark and move", but to where? A "discard" path?

hchapman commented 11 years ago

beets already seems to be content skipping/removing albums at user query, so perhaps the trumping could (possibly unless the user informs otherwise) provide a summary of/query for the destructive actions it will take at the end of an import? I am not entirely clear if this fits into the structure/morals of the project however.

telejester commented 11 years ago

I wouldn't want to change the ui assumptions that much.

The more I think this through, the more I like adding a 'discard' path which has the same organization as the library tree except items and attachments which are moved there are removed from the beets db.

There ought to be "delete" config option for the brave (or disk-space-limited) but I think a nondestructive default is wise.

I like this option because the discarded assets would be kept organized and tagged, but outside the library tree and database. That way, if there is a bug or some error is made, the affected assets could be found and re-imported easily but as far as beets is concerned, once discarded they are out of the db, don't exist any more, and are no longer beets' responsibility.

sampsyo commented 11 years ago

Interesting! I like the "discard pile" idea. We could even use the system trash for this purpose, as in #149.

sampsyo commented 11 years ago

A couple of notes on this:

The project called Dnuos has code that detects LAME presets in MP3 headers.
See also #287, which proposes using album-level fields in path formats—specifically quality/bitrate information.

hchapman commented 11 years ago

Dnuos is, I believe, the source of the original mutagen patch proposed years ago, which I tried to make admissible (but have since heard nothing). On May 24, 2013 12:14 PM, "Adrian Sampson" notifications@github.com wrote:

A couple of notes on this:

The project called Dnuos has code that detects LAME presetshttps://bitbucket.org/brodie/dnuos/src/01f6eb7e49bf3bf8fce964e3630e202754e21334/dnuos/audiotype.py?at=default#cl-487in MP3 headers.

See also #287 https://github.com/sampsyo/beets/issues/287, which proposes using album-level fields in path formats—specifically quality/bitrate information.

— Reply to this email directly or view it on GitHubhttps://github.com/sampsyo/beets/issues/116#issuecomment-18414680 .

sampsyo commented 11 years ago

Aha, good to know.

Maybe we should ask on the mailing list about merging that patch? Or find a way to do the detection in MediaFile instead?

squisher commented 10 years ago

Ping, how come this patch is collecting dust? :-) I need something like this before I can attempt to actually put beets to use...

sampsyo commented 10 years ago

There are a few outstanding issues awaiting a champion:

[ ] Add LAME preset detection. This is actually orthogonal and should maybe have its own ticket.
[ ] Add logic for short-circuiting the duplicate detection. This should be in core. In its most basic form, a config option makes the decision.
[ ] Either in core or in a plugin, develop a flexible/complex set of rules for making decisions automatically. This can come after the basic work is done.

Please feel free to chip in on any of the steps.

daks commented 10 years ago

This issue (the original one on googlecode) resume exactly what I want. I'll track it.

johtso commented 10 years ago

This would be a very useful feature to have!

Alderian commented 10 years ago

Hi,

This would be very nice, I'm doing this manually with 'duplicates' plugin using "beet dup -F -f '$path $bitrate'"... and then a manual rm... very simple, but painful

It would be useful to have the list of duplicates any time an import finds them. I mean, program will find a file already in library and have the newcomer. What if the program just detail the tags, that are different, of the two of them, this way I could choose wisely. This could be done the same way the program tells whenever its correcting tags. Ie:

/foo/bar/Megadeth/Rust in Peace/06 Lucretia.mp3 Tagging track: Megadeth - Lucretia (Similarity: 100.0%) This item is already in the library! Candidates:

Old (mp3, 128kbps)(encoder, bitrate)
New (mp3, 196Kbps)(encoder, bitrate) 1, 2 to keep only that file, [S]kip new, Keep all, Remove old?

I'm considering that I might want to have lots of versions of any file.

With that information, one can easily decide what to do... also, I guess that adding this would be easier, just to wait for the other options. I don't know how to program in python to add this functionality myself :( I might learn, fork and contribute :)

Just my 2 cents

Thanks for this great program...

franciscolourenco commented 9 years ago

I'm also interested in having V0 V2 320kb available to use in paths

franciscolourenco commented 9 years ago

Would it be possible to specify which identifiers are used to detect duplicates? Just like %aunique identifiers, but for duplicate management, so that beet importdoesn't ask about duplicates if, for example, the formats are different.

sampsyo commented 9 years ago

@aristidesfl Sounds like a reasonable extension, yes.

Schweinepriester commented 9 years ago

push :)

lazka commented 9 years ago

mutagen trunk (next version) will now expose lame version and bitrate mode. Feedback welcome.

See https://bitbucket.org/lazka/mutagen/issues/66

sampsyo commented 9 years ago

Fantastic! Awesome work. I'll look into exposing this in beets.

ekjaker commented 8 years ago

Fantastic idea this addition to have this at import!

"Would it be possible to specify which identifiers are used to detect duplicates? Just like %aunique identifiers, but for duplicate management, so that beet importdoesn't ask about duplicates if, for example, the formats are different."

+1 on that!!

And also, at import the option to set default actions would be very helpful, like always take new. Ofcourse, it this gets implemented to its full extent with a scoring system based on bitrate, availability of log/cue,... would be even better than a basic new/old option.

Joshfindit commented 8 years ago

Sorry to complicate this, but I have something to add:

Joint stereo vs Separate stereo. Personally, I'd much rather have a 192kbps separate stereo copy than a 320kbps joint stereo copy.

lazka commented 8 years ago

Personally, I'd much rather have a 192kbps separate stereo copy than a 320kbps joint stereo copy.

Why?

smlx commented 8 years ago

Personally, I'd much rather have a 192kbps separate stereo copy than a 320kbps joint stereo copy.

The mid-side stereo transform is lossless, and reduces redundant encoding of common information between channels.

beetbox / beets

Quality-based Trumping during import #116

This would be superb. In my spare time I was going to try to write something like this.