brasslock / rubyripper

Automatically exported from code.google.com/p/rubyripper
1 stars 0 forks source link

Use Musicbrainz instead of FreeDB for tag lookup #284

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
This is an enhancement request rather than a bug report.

I love Rubyripper and use it often. However, I often find myself having to
cut and paste tag data from Musicbrainz because the FreeDB data is wrong
and/or poorly formatted. I can cite examples if you'd like, but I think the
openness of Musicbrainz' data collection and editing process is clearly
better on the merits. 

Errors get fixed, for instance.

They have redistributable libraries and lots of sample code, although no
ruby that I can see. 

http://musicbrainz.org/

Original issue reported on code.google.com by uptownto...@gmail.com on 9 Mar 2009 at 11:24

GoogleCodeExporter commented 9 years ago
Last time I looked at the code I didn't understand what the heck they were 
doing. But
perhaps times have changed. If anyone is willing to help, providing some ruby 
sample
code for example, it would be greatly appreciated.

Original comment by rubyripp...@gmail.com on 5 Apr 2009 at 2:13

GoogleCodeExporter commented 9 years ago

Original comment by rubyripp...@gmail.com on 5 Apr 2009 at 2:13

GoogleCodeExporter commented 9 years ago
As I see it, freedb doesn't need to be replaced. But I can add Musicbrainz as a
choice perhaps. Please contribute to the wikipage I've set up.

Original comment by rubyripp...@gmail.com on 15 Apr 2009 at 6:09

GoogleCodeExporter commented 9 years ago
That seems entirely reasonable to me. People should be able use whatever they 
want.

As an aside, I'm not sure how far you've dug in the musicbrainz stack but I 
should
say that the mb site itself allows users to import FreeDB data for releases that
don't exist in mb. 

http://musicbrainz.org/freedb/freedb.html

I'll go look at the wiki

Original comment by uptownto...@gmail.com on 15 Apr 2009 at 6:24

GoogleCodeExporter commented 9 years ago
I've added you as a member, so you can edit the wiki pages.

Original comment by rubyripp...@gmail.com on 15 Apr 2009 at 6:37

GoogleCodeExporter commented 9 years ago
One can use FreeDB gateway to Musicbrainz in Rubyripper :
see http://musicbrainz.org/doc/FreeDBGateway

The only drawback is that you don't get the genre (see "Limitations" in the 
above
page). However it can be a solution while waiting for the full MB support.

Original comment by wizzim@gmail.com on 9 May 2009 at 5:11

GoogleCodeExporter commented 9 years ago
Hi, I work for MusicBrainz and I'd be interested in helping you guys get this 
in or
submitting a patch. What type of functionality would you be looking it in 
RubyRipper?

Original comment by oliver.g...@gmail.com on 10 Dec 2009 at 5:57

GoogleCodeExporter commented 9 years ago
Yes! Please add musicbrainz as an option!

Original comment by chrisg...@gmail.com on 17 Feb 2011 at 11:45

GoogleCodeExporter commented 9 years ago
Just finished a reasonably sized effort at adding MusicBrainz support to 
rubyripper.  In particular:

* It generally tries to mimic the existing freedb code architecture
* It uses the XML Web Service version 2 to handle queries.
* If we fail to settle on a single release (even using the heuristics described 
below), the MusicBrainz metadata will fall back on Freedb instead (although the 
functionality exists to support choosing a release much like it does in the 
freedb code)
* It may be turned off if need be with a preference setting
* It supports the same metadata values as freedb
* There are a few extra dependencies here: rexml, openssl, and base64.  These 
should all be present in the standard libraries with Ruby 1.9, so I don't 
expect there to be too many problems here.

It seems to work fine for me, based on my limited testing, and the rspec files 
for most of the functionality check out, so feel free to try it.  I've attached 
a single squashed commit which contains all of my changes needed to add 
MusicBrainz support.  You can try using "git am" to apply it, although you 
might have a little trouble forcing it to work without the 4 latest patches on 
Issue 485, since the commit IDs will be off.  I certainly hope it helps at any 
rate.

PLEASE NOTE: This patch hasn't had gettext run on it to collect the new 
translations from cliPreferences.rb, and since I'm not running this in a GUI 
environment, none of the preferences currently show up in the GTK+ preferences 
window (they only work from the CLI)!  These should probably be fixed before 
actually doing a real release!

Some notes on design, by the way:

DiscID Calculation:

It currently calculates discids manually (since there's no real standard 
program other than cdrdao which supports calculating MusicBrainz discids).  
Thus, due to not having direct access to the drive, I can't absolutely 
guarantee that the discids are correct.  That said, they should only be 
incorrect in absurdly rare cases where the last session of a multi-session disc 
has more than one track and either the first or last track in that session is a 
data track, but not both.  I've never actually come across such a disc, and 
strongly doubt they exist.

Similarly, because of the special handling of data tracks, like freedb-id 
calculation, if cdinfo (or, now, cdcontrol) is not present, MusicBrainz will be 
unable to identify Enhanced CDs.

Identical Metadata, Different Releases:

Since the Next Generation Schema has separated the concept of a tracklist from 
release and thus caused some explosion in the number of results which will be 
returned for a DiscID query, I've implemented some (admittedly not the most 
efficient) heuristics to try to automatically select a reasonable guess as to 
the correct release and reduce the number of "false" multi-release result-sets. 
 In particular, the preferMusicBrainzCountries and preferMusicBrainzDate 
preferences can be used to guide these heuristics.

preferMusicBrainzCountries is a comma-delimited list of country codes 
(including the MusicBrainz-specific countries XW for world-wide releases and XE 
for European releases) representing the (partial) order in which releases 
should be preferred based on their release country.  The default value 
"US,UK,XW,XE,JP" will prefer US releases to British releases, British releases 
to Worldwide releases, and so on.  Releases which belong to a country which is 
not in the list will be preferred less than any country in the list, and 
releases which LACK a country will neither be more nor less preferred than 
releases with a country (although there is a bias towards releases which have a 
country in the preferred country list).  An empty string will enforce no 
country preferences; all countries will be treated equally.

preferMusicBrainzDate may take one of three values to express the order of 
preference of releases by release date.  'earlier' (the default) will prefer 
earlier release dates to later ones, 'later' will do the opposite, and 'no' 
will treat all release dates equally.  Releases without a release date will be 
neither more nor less preferred than other releases, although there is a bias 
towards releases which have a date (except when 'no' is specified).

There is an additional setting, useEarliestDate, which, if set to true, will 
use the earliest possible release date to set the year, even if it does not 
belong to a CD.  This is useful for people who would prefer to have the 
original release year, rather than the release year of the particular CD (e.g. 
for albums originally released in the 60s and 70s on LP).  This isn't a very 
smart setting though; special editions of albums will ALSO be assigned the 
original release date, and recent albums originally released digitally will be 
given the date of their (earlier) digital release, rather than their physical 
release.

Genres:

MusicBrainz does not inherently support genres like freedb.  It does, however, 
support folksonomy tags of its releases, release groups, and artists.  I try to 
do my best to map from these folksonomy tags to ID3 genres where possible (a 
small number of folksonomy tags which are not letter for letter equal to an ID3 
genre are mapped as well, and this can be extended).  In particular, I try to 
look first at the most popular tags on the release group before falling back on 
the album artists (and for various artist albums, the track artists) if I can 
find no matching tags.  All other things considered equal, I also look at 
alphabetical order and artist order, so there's no guarantee that your David 
Bowie albums won't all be tagged as "Art Rock".  It's not the best solution, 
and no doubt many albums will be missing genres (they'll be set nil) but it 
works okay given what we've got.

Various Artists:

Unlike freedb, MusicBrainz handles Various Artists gracefully.  Compilations 
with many artists are usually tagged with an album artist of "Various Artists", 
while each track can be assigned a different artist.  We don't need to actually 
guess at this, and can just use this guideline to lead us.

It also supports Split Albums as well (albums where two or three artists share 
time on the album), and I apply a few heuristics to separate albums where two 
artists are collaborating on all tracks from true splits (where artists take 
turns with different tracks).  The former is treated as a single artist, while 
the latter is considered a various artists release.

Also, with the support for track-level artist associations, there's a bit of 
code to avoid tagging albums which have a single album artist as being a 
various artists album (otherwise, things like Disc 2 of the 30th Anniversary 
edition of Ziggy Stardust would be tagged as Various Artist albums)

Original comment by comradec...@gmail.com on 31 Oct 2011 at 7:53

Attachments:

GoogleCodeExporter commented 9 years ago
Addendum:

I suppose there's the question of character-set encoding...  I'm just using the 
data I get from REXML.  Since the XML from MusicBrainz is in UTF-8, it seems 
reasonable to assume that's what I'm getting as well, but I'm not sure, and 
haven't tested that (all of my tests have been strictly ASCII characters)

Original comment by comradec...@gmail.com on 31 Oct 2011 at 7:58

GoogleCodeExporter commented 9 years ago
“There is an additional setting, useEarliestDate, which, if set to true, will 
use the earliest possible release date to set the year, even if it does not 
belong to a CD.”

I would suggest instead simply setting the ORIGINALDATE (Vorbis) or TDOR (ID3 
2.4) or TORY (ID3 2.3) tags.

“Unlike freedb, MusicBrainz handles Various Artists gracefully…”

Does this patch use the MBID (89ad4ac3-39f7-470e-963a-56509c546377) of Various 
Artists to recognize a various artists release? or will it get confused by the 
other artists which are actually named Various Artists?

“things like Disc 2 of the 30th Anniversary edition of Ziggy Stardust would 
be tagged as Various Artist albums”

Is this not correct? It is a various artists album, with the album artist set 
to David Bowie.  That’s what shows up in MB anyway: 
http://musicbrainz.org/release/952f4c05-1941-37af-9844-b9f84faf195f

Original comment by ha...@hawkesnest.net on 31 Oct 2011 at 4:59

GoogleCodeExporter commented 9 years ago
Good to see somebody working on this :) Just a few random comments:

1. Regarding the disc ID calculation, have you considered using libdiscid? This 
is a well tested library, and there are wrappers for Ruby, e.g. mb-discid 
(http://rbrainz.rubyforge.org/).

2. The data returned by the web service is indeed in UTF-8.

Original comment by ph.wolfer on 31 Oct 2011 at 5:43

GoogleCodeExporter commented 9 years ago
With respect to Comment 11:

I agree that ORIGINALDATE/TDOR/TDOY would be the "correct" way of doing it, but 
for now I'm assuming that there are no changes elsewhere in the code, so as to 
keep the number of changes needed to add MusicBrainz support to a minimum.  
Even if support for tagging original dates is added, however, people may prefer 
that the year be set to the original year in any case (e.g. if their player of 
choice does not correctly support ORIGINALDATE/TDOR/TDOY)

Yes, this patch explicitly makes use of the Various Artists pseudo-artist (See 
analyzeResult in lib/rubyripper/musicbrainz/musicbrainzReleaseParser.rb, which 
starts at line 843 in the patch above) rather than relying on artist name.  The 
only time it should ever assume that a disc is a Various Artists disc is if 
there are at least two tracks which do not share the same artist(s), and even 
then, this requires that the album artist be the Various Artists pseudo-artist 
OR that the album have more than one name-credit (An example of the latter 
would be http://musicbrainz.org/release/822160f4-17bb-48c5-b62c-ec7e687dc94c 
with an album credit to "Bright Eyes" and to "Son, Ambulance", but tracks 
credited individually to one or the other, but not both.  Incidentally, this 
particular album, which serves as one of the rspec test cases, also 
demonstrates that we use the name-credit ("Son, Ambulance") rather than the 
artist's name ("Son Ambulance") if given)

For the purposes of Rubyripper, I believe that this would result in undesirable 
behavior; Disc 1 would be filed according to the normal naming scheme, while 
Disc 2 would be filed according to the Various Artists scheme, resulting in 
files from different discs of the same album being filed in completely 
different places.  Of course, the way I've implemented it DOES mean that a 
"Various Artists" release with multiple single-artist discs would end up having 
files named according to the normal naming scheme, but this can be fixed by 
treating ALL various artist albums as various artist discs.

With respect to Comment 12:

I did consider using libdiscid, but wanted to keep the implementation simple 
for the time being (and also because I have recently been working on a 
different library which would not only accomplish libdiscid's goals, but also 
address Issue 316 and enable ISRC/MCN and track-index detection without the use 
of any additional software packages like cdrdao or cdrtools: 
https://github.com/pipian/libcueify  Please note that the code there is still 
under heavy development, and doesn't work on anything but FreeBSD at the 
moment.)

Also, when I was talking about UTF-8, I meant that I suspect that REXML is 
giving me Ruby String objects which are 8-bit byte strings encoded as UTF-8, 
rather than (alternatively) wide character strings.

By the by (and this is for Bouke's sake), I do have a batch of single-commit 
changes which, collectively, merge in MusicBrainz support file by file, but 
there are quite a few (read 11 or 12) of these and didn't want to overwhelm him 
with them due to the size of the change (obviously).  I'd be happy to introduce 
them one at a time to get merged in piecemeal, although they (effectively) use 
git revision e7b24547885e as a fork point (although I am trying to keep up with 
the master branch)

Original comment by comradec...@gmail.com on 1 Nov 2011 at 1:52

GoogleCodeExporter commented 9 years ago
I've merged this HUGE thing with the master branch. This really is a lot of new 
code and is easily the biggest patch I've ever received :)

I will clean it up a bit, since there are some things I don't particularly like:
* I don't want a freedb and musicbrainz category. This should be one metadata 
category.
* I don't want to force the disc class to have logic in it which metadata 
subclass it should use. It should only trigger the metadata to be fetched.
* The cucumber tests are broken currently.
* Some of the files should be reorganized for clarity.

Original comment by boukewou...@gmail.com on 7 Nov 2011 at 6:51

GoogleCodeExporter commented 9 years ago
Ok. So I've got to the point that:
* the files are structured much better :)
* the cucumber feature tests are working once again :)
* the disc class still needs some more attention :(
* the unit tests are Sloooow :(. You added six seconds to them (they were 0.8 
seconds).

But only one thing at a time. I'd say this was already quite some work.

Original comment by boukewou...@gmail.com on 7 Nov 2011 at 10:14

GoogleCodeExporter commented 9 years ago
Yeah, the unit tests being slow are largely a result of the release-picking 
heuristic, which uses a rather complicated scoring algorithm in 
GetMusicBrainzRelease.multipleReleasesFound(releases) that's (at least) 
quadratic in the number of releases.  I've tried a couple of low-lying fruit to 
try to push it down (e.g. I use counters where the original code used arrays), 
but thus far it hasn't helped much.

It's also possible that the date-sorting code is slow (most cases will run 
through all four cases and only return based on the x <=> y operator).

In short, the GetMusicBrainzRelease tests could probably afford to be profiled 
to figure out where the slowest parts are causing a hangup.  I'll try to run it 
myself and see what I can get.

Original comment by comradec...@gmail.com on 8 Nov 2011 at 5:31

GoogleCodeExporter commented 9 years ago
After running some profiling, it turns out that, based on 10 trials, fully half 
the time (13 of 26 seconds) was being spent simply on reading the XML files 
with REXML, while a good chunk of the remainder (8 seconds) was in doing the 
XPath queries.  I'll see what I can do to speed it up (e.g. switching strictly 
to a stream-based SAX parser, skipping the complex document-building process.)

Original comment by comradec...@gmail.com on 8 Nov 2011 at 6:02

GoogleCodeExporter commented 9 years ago
I don't know if it's possible, but instead of parsing a complete XML, it might 
be possible to just parse a small part of it and include this as text in the 
spec file, just enough to test the specific case.

Are you sure there isn't a connection to the internet being setup somewhere? 
This is a known slowdown and should be prevented in unit tests.

Original comment by boukewou...@gmail.com on 8 Nov 2011 at 7:02

GoogleCodeExporter commented 9 years ago
No, I'm fairly sure, based on the Ruby-Prof results, that the internet is not 
being used.  REXML has a bit of a history with taking time to do parsing as 
well.  See: 
http://groups.google.com/group/comp.lang.ruby/browse_thread/thread/6b6848e0be089
a3e and 
http://townx.org/blog/elliot/ruby_tuesday_parsing_xml_with_rexml_document_and_re
xml_streamlistener

I could trim the sample XML down a bit, but on the other hand, I would like to 
make sure that the actual output from the server is being tested so that we 
won't be "surprised" by complex XML data returned by the server.  Like I said, 
I'll give a shot at using the SAX parser and skipping the entire 
document-building process, which seems to make up about a quarter of the time 
at the very least.  I suspect that by going with SAX and simply populating a 
custom data structure (since we don't care about most of the XML structure) we 
could cut load-times.

Alternately, rewriting the XML-handling code to use nokogiri 
(http://nokogiri.org/) or libxml-ruby (http://libxml.rubyforge.org/) would not 
only be easier, but could also gain us some performance by pushing the XML 
parsing into C code rather than entirely doing such in ruby.  Either would add 
extra dependencies on libxml2 and the appropriate XML library.

Original comment by comradec...@gmail.com on 8 Nov 2011 at 7:33

GoogleCodeExporter commented 9 years ago
You're right about the internet, the specs still work when the network is 
stopped. I'll have a look if I can improve it some. It would be wrong to add 
dependencies only to improve the unit test performance.

Original comment by boukewou...@gmail.com on 8 Nov 2011 at 5:51

GoogleCodeExporter commented 9 years ago
I think the spec tests should be rewritten with just the needed XML stuff in 
it. This could be achieved based on the information I've seen in:
http://svn.musicbrainz.org/mmd-schema/trunk/schema/musicbrainz_mmd-2.0.rng

Only include the XML that is neccesary to let the test succeed. The rest is 
noise and slows the test down a lot.

I do understand your concerns for the complete scenarios. Well, you might want 
to turn one a few of the more complex XML files into a cucumber feature test to 
be sure. These are ment to be more end-to-end.

Original comment by boukewou...@gmail.com on 8 Nov 2011 at 6:43

GoogleCodeExporter commented 9 years ago
Tried pruning some files, but it didn't really seem to have an impact on the 
rspec time (3.22s to 2.91s for me).  I've also switched the rspec tests which 
read the XML files to cache the data rather than re-read it for each test, but 
since most of the parsing is actually happening when trying to parse the artist 
tag XML (which happens with each query and can't be cached effectively without 
actually stubbing out part of MusicBrainzReleaseParser's implementation which 
is being tested), it hasn't helped much more (2.91s to 2.8s at best).

Still, it should be a little bit of an improvement.

Original comment by comradec...@gmail.com on 15 Nov 2011 at 12:51

GoogleCodeExporter commented 9 years ago
Thanks for your efforts here :) The specs take 4.5 seconds now on my machine, 
which is more than 25% overall improvement (it was 6.2). This is sufficiently 
fast in my opinion.

Original comment by boukewou...@gmail.com on 15 Nov 2011 at 6:43