I'd like to improve the logic around mapping URLs to entities (e.g. https://foo.bandcamp.com/ to an artist or label) when seeding from Bandcamp or Tidal.
Right now, if a single entity has a relationship with the URL, then the entity is used when seeding. If multiple entities have relationships with the URL, then the one with the shortest edit distance between its name in the database and on Bandcamp/Tidal is used. The "credited as" field is never set, so the seeded field just shows the name as it appears in the database.
This behavior is annoying in cases like this one:
Artist A has a relationship with https://a.bandcamp.com/.
Artist A releases an album at https://a.bandcamp.com/album/title credited to A & B.
The seeded page credits the album to A, and the editor needs to manually replace the credit with a pair of credits referencing both A and B (with the appropriate join phrase).
Or this one:
Artist A has a relationship with https://a.bandcamp.com/.
Artist A releases an album at https://a.bandcamp.com/album/title credited to B (not uncommon for artist-run labels).
The seeded page credits the album to A, and the editor needs to manually replace the credit with one referencing B (and possibly create B if they aren't already in the DB).
Setting the "credited as" field would probably make things worse in both of these cases, as I believe that it will hide the incorrect credits from the editor -- in both cases, the page would have a green field showing the name as credited on Bandcamp but actually linking to A's MBID.
I think it'd be better to only seed the MBID when the names on the page and database are very similar (edit distance of 1 or 2?). Any bigger differences seem like they probably need a human's attention to e.g. split the credits into multiple artists or consider creating a new artist. Just to mention it, I should require an exact match for very short names to handle cases like the A and B one I made up above (since the edit distance between those strings is just 1).
I think it's probably okay to still leave the "credited as" field blank; online sources are a mess and it's probably safest to stick with the DB name.
I suspect that there will be some cases where this change would result in a match not being made where it actually should be (e.g. an artist name is stylized in a weird manner on Bandcamp, which I've seen often), but it's not the end of the world if the editor needs to click the search button and manually select the appropriate entity.
I'd like to improve the logic around mapping URLs to entities (e.g.
https://foo.bandcamp.com/
to an artist or label) when seeding from Bandcamp or Tidal.Right now, if a single entity has a relationship with the URL, then the entity is used when seeding. If multiple entities have relationships with the URL, then the one with the shortest edit distance between its name in the database and on Bandcamp/Tidal is used. The "credited as" field is never set, so the seeded field just shows the name as it appears in the database.
This behavior is annoying in cases like this one:
https://a.bandcamp.com/
.https://a.bandcamp.com/album/title
credited toA & B
.Or this one:
https://a.bandcamp.com/
.https://a.bandcamp.com/album/title
credited toB
(not uncommon for artist-run labels).Setting the "credited as" field would probably make things worse in both of these cases, as I believe that it will hide the incorrect credits from the editor -- in both cases, the page would have a green field showing the name as credited on Bandcamp but actually linking to A's MBID.
I think it'd be better to only seed the MBID when the names on the page and database are very similar (edit distance of 1 or 2?). Any bigger differences seem like they probably need a human's attention to e.g. split the credits into multiple artists or consider creating a new artist. Just to mention it, I should require an exact match for very short names to handle cases like the A and B one I made up above (since the edit distance between those strings is just 1).
I think it's probably okay to still leave the "credited as" field blank; online sources are a mess and it's probably safest to stick with the DB name.
I suspect that there will be some cases where this change would result in a match not being made where it actually should be (e.g. an artist name is stylized in a weird manner on Bandcamp, which I've seen often), but it's not the end of the world if the editor needs to click the search button and manually select the appropriate entity.