beetbox / beets

music library manager and MusicBrainz tagger
http://beets.io/
MIT License
12.58k stars 1.8k forks source link

Some releases don't get their works fetched #3308

Closed dosoe closed 4 years ago

dosoe commented 5 years ago

Problem

I implemented the work, mb_workid and work_disambig tags not so long ago (https://github.com/beetbox/beets/pull/3272) . My problem is: for some recordings, the works just don't get fetched. It concerns especially the very big releases (20+ CDs like for example https://musicbrainz.org/release/9c5c043e-bc69-4edb-81a4-1aaf9c81e6dc or https://musicbrainz.org/release/9bcd75dd-995e-482b-8ba7-1ef074d253de ). I tried to backtrace the error (by putting random prints in the beets/autotag/mb.py and then running beet mbsync on the problematic releases). What I can see is:

while RELEASE_INCLUDES (in beets/autotag/mb.py) does contain 'work-rels', and 'work-level-rels', for the releases I'm looking at, TRACK_INCLUDES doesn't: musicbrainzngs.VALID_INCLUDES['recording'] contains 'work-rels' but not 'work-level-rels', which is odd. At line 494 in beets/autotag/mb.py, musicbrainzngs.get_release_by_id(albumid,RELEASE_INCLUDES) doesn't contain any works, even if the recordings do have works and RELEASE_INCLUDES contains 'work-rels', and 'work-level-rels'. I first tried to check musicbrainzngs.get_recording_by_id(recording['id'], TRACK_INCLUDES) for all the recordings and it turns out it doesn't contain any works, because of the first error. If now I add 'work-rels' to TRACK_INCLUDES and then look at musicbrainzngs.get_recording_by_id(recording['id'], TRACK_INCLUDES) then it contains the works just fine.

So I'm wondering: why do we get the work relationships with musicbrainzngs.get_recording_by_id(recording['id'], TRACK_INCLUDES) but not with musicbrainzngs.get_release_by_id(albumid, RELEASE_INCLUDES) even if both ask for 'work-rels and work-level-rels for musicbrainzngs.get_release_by_id ?

A quick and dirty fix would be to ask for musicbrainzngs.get_recording_by_id(recording['id'], TRACK_INCLUDES) for each track. The problem is, there is a significant performance loss if we have one musicbrainz query for each track instead of each release, but I didn't look at it in too much detail. It seems to me that musicbrainzngs doesn't send all info we ask for for very big releases, could that be because it is too big and they have a cap on the maximum size they can send?

Of course, I checked: the concerned recordings do have works on MB.

Setup

My configuration (output of beet config) is:

asciify_paths: yes
bell: yes
color: yes
alternatives:
    by-work:
        directory: ../beets_by-work

        paths:
            albumartist_sort::'Wise Guys': ZZZ_Non-classical/$albumartist_sort/$album %ifdef{albumdisambig,($albumdisambig),}/$disc-$track $title
            ...
        formats: link
library: /media/soergeld/EEC7-74A2/Musique/beets/Musicdata_sync.db
embedart:
    remove_art_file: yes
    compare_threshold: 0
    auto: yes
    ifempty: no
    maxwidth: 0

plugins: duplicates info mbsync badfiles fromfilename mbsubmit mbcollection absubmit fetchart embedart chroma parentwork alternatives

musicbrainz:
    user: dosoe
    pass: REDACTED
absubmit:
    auto: no
    extractor: /home/soergeld/Personnel/acousticbrainz-client/streaming_extractor_music

paths:
    default: $albumartist_sort/$album/$disc-$track $title
    singleton: ZZZ_noalbum/$artist_sort/$title
parentwork:
    force: no
    auto: no
missing:
    count: yes
chroma:
    auto: no
duplicates:
    tiebreak:
        items: [bitrate]
    keys:
    - mb_releasegroupid
    - track
    - disc
    - title
    count: no
    full: no
    format: ''
    move: ''
    tag: ''
    path: no
    copy: ''
    album: no
    strict: no
    checksum: ''
    merge: no
    delete: no

ui:
    color: yes
    colors:
        text_success: green
        text_warning: yellow
        text_error: red
        text_highlight: red
        text_highlight_minor: lightgray
        action_default: turquoise
        action: blue
directory: /media/soergeld/EEC7-74A2/Musique/beets

import:
    copy: no
    write: no
    incremental: no
    detail: yes
    languages: en
    duplicate_action: keep
    per_disc_numbering: yes
    bell: yes
acoustid:
    apikey: REDACTED
mbsubmit:
    format: $title - $artist ($length)
    threshold: strong
mbcollection:
    auto: no
    remove: no
    collection: ''
fetchart:
    auto: yes
    minwidth: 0
    sources:
    - filesystem
    - coverart
    - itunes
    - amazon
    - albumart
    google_engine: 001442825323518660753:hrh5ch1gjzm
    enforce_ratio: no
    cautious: no
    maxwidth: 0
    store_source: no
    google_key: REDACTED
    fanarttv_key: REDACTED
    cover_names:
    - cover
    - front
    - art
    - album
    - folder
sampsyo commented 5 years ago

Hmm; is it really just big releases that don't include work relationships? Can you link to an example where the works are included and one where they're not?

It can be instructive, when investigating MB behaviors like this, to take these and load the actual API response in a browser. Here's an API lookup for one of those albums, for instance: https://musicbrainz.org/ws/2/release/9c5c043e-bc69-4edb-81a4-1aaf9c81e6dc?inc=recordings+work-rels

This can let you quickly experiment with different releases and inc values to see what appears.

You might also be interested in reading the MusicBrainz API docs, which describe what relationships are legal for which entities: https://musicbrainz.org/doc/Development/XML_Web_Service/Version_2#Lookups

Please see if you can narrow down what's going on with MusicBrainz itself by talking directly to the API! If it seems to be doing something "wrong," we can file a bug with the MB folks.

I don't think we should make separate MB requests for every recording by default.

dosoe commented 5 years ago

Indeed when I check https://musicbrainz.org/ws/2/release/9c5c043e-bc69-4edb-81a4-1aaf9c81e6dc?inc=media+recordings+release-groups+labels+artist-credits+aliases+recording-level-rels+work-rels+work-level-rels+artist-rels which is the equivalent of musicbrainzngs.get_release_by_id(albumid,RELEASE_INCLUDES) (with the same RELEASE_INCLUDES it does not show any work relationships:

<track id="19fbfd3b-92b6-3c1b-b7a6-703048a128a7">
  <position>1</position>
  <number>1</number>
  <title>Goldberg-Variationen, BWV 988: Aria</title>
  <length>113893</length>
  <artist-credit>
    <name-credit>
      <artist id="24f1766e-9635-4d58-a4d4-9413f9f98a4c">
      <name>Johann Sebastian Bach</name>
      <sort-name>Bach, Johann Sebastian</sort-name>
      <disambiguation>German Baroque period composer & musician</disambiguation>
      <alias-list count="40">
        <alias sort-name="Bach" type="Search hint" type-id="1937e404-b981-3cb7-8151-4c86ebfc8d8e">Bach</alias>
        </alias-list>
      </artist>
    </name-credit>
  </artist-credit>
  <recording id="d57d7065-020f-4648-b1ca-12c9ba72f78d">
  <title>Goldberg Variations, BWV 988: Aria</title>
  <length>113786</length>
  <artist-credit>
    <name-credit>
      <artist id="7002bf88-1269-4965-a772-4ba1e7a91eaa">
      <name>Glenn Gould</name>
      <sort-name>Gould, Glenn</sort-name>
      <disambiguation>pianist</disambiguation>
      <alias-list count="3">
        <alias type="Search hint" type-id="1937e404-b981-3cb7-8151-4c86ebfc8d8e" sort-name="1)Glenn Gould">1)Glenn Gould</alias>
        </alias-list>
      </artist>
    </name-credit>
  </artist-credit>
<alias-list count="4">
  <alias sort-name="Aria from Goldberg Variations BWV 988 (1955 recording) - Johann Sebastian Bach">Aria from Goldberg Variations BWV 988 (1955 recording) - Johann Sebastian Bach</alias>      
  </alias-list>
</recording> 
</track>

As can be seen, the artists and their aliases are there, but that's pretty much all. Also, I would have expected recording dates to be there as well as instruments (Gould as pianist).

If we now look at https://musicbrainz.org/ws/2/release/db49c56b-7e11-4cbc-8fcc-577a031e8cd6?inc=media+recordings+release-groups+labels+artist-credits+aliases+recording-level-rels+work-rels+work-level-rels+artist-rels which is a release that contains exactly the same recordings (it's pretty much the first medium of the release above). There, the first track is much more detailed:

<track id="73e69279-d8c2-3a26-89ae-dc67535be2ee">
<position>1</position>
<number>A1</number>
<length>112693</length>
<artist-credit>
    <name-credit>
        <artist id="24f1766e-9635-4d58-a4d4-9413f9f98a4c">
            <name>Johann Sebastian Bach</name>
            <sort-name>Bach, Johann Sebastian</sort-name>
            <disambiguation>German Baroque period composer & musician</disambiguation>
            <alias-list count="40">
                <alias sort-name="Bach" type-id="894afba6-2816-3c24-8072-eadb66bd04bc" type="Artist name">Bach</alias>
            </alias-list>
        </artist>
    </name-credit>
</artist-credit>
<recording id="d57d7065-020f-4648-b1ca-12c9ba72f78d">
<title>Goldberg Variations, BWV 988: Aria</title>
<length>113786</length>
<artist-credit>
    <name-credit>
        <artist id="7002bf88-1269-4965-a772-4ba1e7a91eaa">
            <name>Glenn Gould</name>
            <sort-name>Gould, Glenn</sort-name>
            <disambiguation>pianist</disambiguation>
            <alias-list count="3">
                <alias type-id="1937e404-b981-3cb7-8151-4c86ebfc8d8e" type="Search hint" sort-name="1)Glenn Gould">1)Glenn Gould</alias>
            </alias-list>
        </artist>
    </name-credit>
</artist-credit>
<alias-list count="4">
    <alias sort-name="Aria from Goldberg Variations BWV 988 (1955 recording) - Johann Sebastian Bach">Aria from Goldberg Variations BWV 988 (1955 recording) - Johann Sebastian Bach</alias>
</alias-list>
<relation-list target-type="artist">
    <relation type-id="5c0ceac3-feb4-41f0-868d-dc06f6e27fc0" type="producer">
        <target>64078387-5ff3-43d1-b203-38f98ef74c24</target>
        <direction>backward</direction>
        <artist id="64078387-5ff3-43d1-b203-38f98ef74c24">
            <name>Howard H. Scott</name>
            <sort-name>Scott, Howard H.</sort-name>
            <disambiguation>classical music producer</disambiguation>
        </artist>
    </relation>
    <relation type="instrument" type-id="59054b12-01ac-43ee-a618-285fd397e461">
        <target>7002bf88-1269-4965-a772-4ba1e7a91eaa</target>
        <direction>backward</direction>
        <begin>1955-06-10</begin>
        <end>1955-06-16</end>
        <ended>true</ended>
        <attribute-list>
            <attribute type-id="b3eac5f9-7859-4416-ac39-7154e2e8d348">piano</attribute>
        </attribute-list>
        <artist id="7002bf88-1269-4965-a772-4ba1e7a91eaa">
            <name>Glenn Gould</name>
            <sort-name>Gould, Glenn</sort-name>
            <disambiguation>pianist</disambiguation>
        </artist>
    </relation>
</relation-list>
<relation-list target-type="work">
    <relation type-id="a3005666-a872-32c3-ad06-98af558e99b0" type="performance">
        <target>6934e59b-e82c-3050-b0cf-70907db1f1a3</target>
        <begin>1955-06-10</begin>
        <end>1955-06-16</end>
        <ended>true</ended>
        <work id="6934e59b-e82c-3050-b0cf-70907db1f1a3">
            <title>Goldberg-Variationen, BWV 988: Aria</title>
            <language>zxx</language>
            <language-list>
                <language>zxx</language>
            </language-list>
            <relation-list target-type="artist">
                <relation type-id="d59d99ea-23d4-4a80-b066-edca32ee158f" type="composer">
                <target>24f1766e-9635-4d58-a4d4-9413f9f98a4c</target>
                <direction>backward</direction>
                <artist id="24f1766e-9635-4d58-a4d4-9413f9f98a4c">
                    <name>Johann Sebastian Bach</name>
                    <sort-name>Bach, Johann Sebastian</sort-name>
                    <disambiguation>German Baroque period composer & musician</disambiguation>
                </artist>
                </relation>
            </relation-list>
            <relation-list target-type="work">
                <relation type-id="ca8d3642-ce5f-49f8-91f2-125d72524e6a" type="parts">
                    <target>1d51e560-2a59-4e97-8943-13052b6adc03</target>
                    <ordering-key>1</ordering-key>
                    <direction>backward</direction>
                    <work id="1d51e560-2a59-4e97-8943-13052b6adc03">
                    <title>Goldberg-Variationen, BWV 988</title>
                    </work>
                </relation>
                <relation type-id="51975ed8-bbfa-486b-9f28-5947f4370299" type="arrangement">
                    <target>8b683c5c-74d7-4be4-9157-0b706f2f904a</target>
                    <work id="8b683c5c-74d7-4be4-9157-0b706f2f904a">
                    <title>Goldberg-Variationen, BWV 988: Aria</title>
                    <disambiguation>catch-all for arrangements</disambiguation>
                    </work>
                </relation>
            </relation-list>
        </work>
    </relation>
</relation-list>
</recording>
</track>

As we can see, it contains the work title and other relations, composer, arrangements, performers and their instrument, producer, recording date etc. It is the same recording, we asked for the same information but get much more information for a smaller release.
So it seems that the problem is with musicbrainzngs. @Freso , do you know where I could submit the corresponding bug report?

sampsyo commented 5 years ago

Wait, I'm not sure this is a problem in musicbrainzngs, which is the name of the Python library—perhaps you mean the MusicBrainz server? There are details about the MB bug tracker on the wiki: https://musicbrainz.org/doc/Bug_Tracker

dosoe commented 5 years ago

https://tickets.metabrainz.org/browse/MBS-10230

dosoe commented 5 years ago

Answer from MB:

We don't return relationships for releases with more than 500 recordings, because otherwise they would just time out and not return anything at all. The best alternative for this is probably to browse recordings by release in this case.

dosoe commented 5 years ago

Should we implement a check and if the release has more than 500 tracks then get the data track by track?

sampsyo commented 5 years ago

Makes sense!

I don't think we can do that by default—fetching every recording for a 500-track album will take a very long time, and it will be wasted if the user doesn't need work information. Maybe it should be behind a configuration option? Or maybe it could be part of the responsibility of the parentwork plugin, so the process comes off the "critical path" of the import process?

Also, I'm intrigued by this suggestion:

The best alternative for this is probably to browse recordings by release in this case.

Because this person didn't say "you have to fetch every recording individually," it suggests there may still be some way to fetch them all in bulk, by "browsing." Maybe that's worth looking into?