Normalize the database - Githubissues

lGuillaume124 commented 5 years ago

@gs11 @MightyCreak you contributed a lot to Sonerezh and I would like to have your opinion about this important new step.

As you know by now the database has a big table songs in which all the Sonerezh's stuff is stored. This lead to performance issues on views, duplicate information in the database, no flexibility, etc.

I started to think about the new schema, and to normalize it. Here is what I have, instead of one single table:

A table called bands with 2 columns: id and name
A table called albums with 4 columns: id, name, cover and band_id
A tabel called tracks with all the stuff related to the metadata plus a column album_id

We have relationships between these tables:

A band can have multiple albums
An album belongs to one band
A track belongs to one album

Let's illustrate it:

schema

With this schema, I'll be able to create very good new views like one dedicated to a band or an album with custom images / banner from fanart.tv or whatever, just like Spotify and others does. The interface should be more reactive and the database shouldn't be overloaded anymore when you visit the /albums page.

But, the import process should be far much longer because for each file imported we must:

Check if the band already exists in the band table, or create it otherwise
Then check if the album already exists in the album table, or create it otherwise
Then save the track metadata

Any help from anyone will be very appreciated :)

gs11 commented 5 years ago

While the import process might take a little longer the processed import batch will most probably contain multiple tracks with the same band & album. Thus, you only need to check once per batch, store the band/album id in memory to speed up the process.

Might I also suggest adding a boolean to album to be able to distinguish compilations from ordinary albums?

MightyCreak commented 5 years ago

Thanks @lGuillaume124 for letting us in the loop, that's really thoughtful!

I like this new database proposition. There are clear flaws in the current one that this one addresses. I'm pretty sure that the first design that we'll make won't be perfect, but I think it's needed to go forward and fix what we know is already bad, we'll see later if we can do better :wink:

Onto the review now:

I don't think the import performance is really an issue. I don't really care if it takes 5 minutes instead of 3 to import all my library, but I do care that the main interface is fast and responsive. Moreover, I'm not sure it is going to have a huge impact on the import: of course, it was simpler before, but is was also a lot more data to store. And now, it's a bit more complex indeed, but there are much less redundant data to store in the DB, so all in all, it might not have the impact you think it could have. I would not be surprised actually if it were faster, since CPU is thousands of times faster than storage. So while I agree with @gs11 that optimizations are possible here, I think the best is to implement it in its simple design, and see how it goes, and only then starting to optimize (always profile before optimizing :slightly_smiling_face: ).
I think albums should have another column for the year. A track could also have a year as well, in case of a compilation album for instance, but what is shown in the UI is the year of the album, not the year of each song.
I guess it was implied, but I'd rather write it, to be sure: every metadata that is stored in bands and albums must be removed from tracks.

And that's pretty much it! I can't wait to see the new DB design, as you said, it will allows us to do very cool new features!

@gs11 :

Might I also suggest adding a boolean to album to be able to distinguish compilations from ordinary albums?

I don't see what issue it addresses :confused: How will it be presented in the UI? And how will you be able to know, at import time, which album is a compilation and which is not?

gs11 commented 5 years ago

@MightyCreak

I think albums should have another column for the year. A track could also have a year as well, in case of a compilation album for instance, but what is shown in the UI is the year of the album, not the year of each song.

I think the genre column should be moved to album as well as you'll probably want to browse albums by genre. An issue is of course the quality of the tags. This would require the first track to be imported from an album to actually contain year/genre for them to be stored for the album.

@MightyCreak

I don't see what issue it addresses 😕 How will it be presented in the UI? And how will you be able to know, at import time, which album is a compilation and which is not?

Yeah, I realize this is may be less useful and more complex to implement. It's a nice feature in Logitech Media Server (LMS) to browse for compilations but I think LMS actually looks at albums where the artists differ for tracks and not the (kind of ugly) tag ITUNESCOMPILATION . In short - nevermind 🙂

lGuillaume124 commented 5 years ago

I think albums should have another column for the year. A track could also have a year as well, in case of a compilation album for instance, but what is shown in the UI is the year of the album, not the year of each song.

So the year field of album will correspond to the year of the first track imported for this album, right?

MightyCreak commented 5 years ago

Yep, that's it. And the same goes for the genre, as @gs11 mentioned.

I don't think it's too problematic to have some heuristic like that (e.g. considering that some metadata in the first track in valid for all the tracks in the album). In a later milestone, it could even be possible to display when there are some descrepancies in the metadata of tracks of the same album, so that the admin would be able to fix them and re-upload the tracks, and thus helping him or her to keep a nice and clean library.

csdt commented 5 years ago

Hello everyone, I have some suggestions about the new database scheme.

The first thing I want to propose is a table for artists. And a track/song might have multiple artists (featuring for instance).

Also, the genre is not only a property of a track, but also of an album and even an artist. And the genre might not be unique either.

I would propose the following scheme (ignoring for now how to populate it from tags):

CREATE TABLE `artists` (
    `gid` INT NOT NULL AUTO_INCREMENT,
    `name` varchar NOT NULL UNIQUE,
    PRIMARY KEY (`gid`)
);

CREATE TABLE `genres` (
    `gid` INT NOT NULL AUTO_INCREMENT,
    `name` varchar NOT NULL,
    `description` varchar,
    PRIMARY KEY (`gid`)
);

CREATE TABLE `bands` (
    `gid` INT NOT NULL AUTO_INCREMENT,
    `name` varchar NOT NULL,
    `inherits_genres` BOOLEAN NOT NULL DEFAULT '0',
    PRIMARY KEY (`gid`)
);

CREATE TABLE `albums` (
    `gid` INT NOT NULL AUTO_INCREMENT,
    `name` varchar NOT NULL UNIQUE,
    `year` INT,
    `band_id` INT,
    `ndiscs` INT NOT NULL DEFAULT '0',
    `inherits_genres` BOOLEAN NOT NULL DEFAULT '0',
    `inherits_artists` BOOLEAN NOT NULL DEFAULT '1',
    PRIMARY KEY (`gid`)
);

CREATE TABLE `discs` (
    `gid` INT NOT NULL AUTO_INCREMENT,
    `disc_number` INT,
    `album_id` INT NOT NULL,
    `ntracks` INT,
    PRIMARY KEY (`gid`)
);

CREATE TABLE `tracks` (
    `gid` INT NOT NULL AUTO_INCREMENT,
    `title` varchar,
    `source_path` varchar NOT NULL UNIQUE,
    `track_number` INT,
    `year` INT,
    `updated` TIMESTAMP NOT NULL,
    `disc_id` INT NOT NULL,
    `inherits_genres` BOOLEAN NOT NULL DEFAULT '0',
    `inherits_artists` BOOLEAN NOT NULL DEFAULT '1',
    `inherits_year` BOOLEAN NOT NULL DEFAULT '1',
    PRIMARY KEY (`gid`)
);

CREATE TABLE `artist_band` (`artist_id` INT NOT NULL, `band_id` INT NOT NULL);
CREATE TABLE `artist_album` (`artist_id` INT NOT NULL, `album_id` INT NOT NULL);
CREATE TABLE `artist_track` (`artist_id` INT NOT NULL, `track_id` INT NOT NULL);

CREATE TABLE `genre_artist` (`genre_id` INT NOT NULL, `artist_id` INT NOT NULL);
CREATE TABLE `genre_band` (`genre_id` INT NOT NULL, `band_id` INT NOT NULL);
CREATE TABLE `genre_album` (`genre_id` INT NOT NULL, `album_id` INT NOT NULL);
CREATE TABLE `genre_track` (`genre_id` INT NOT NULL, `track_id` INT NOT NULL);

ALTER TABLE `albums` ADD CONSTRAINT `albums_fk0` FOREIGN KEY (`band_id`) REFERENCES `bands`(`gid`);
ALTER TABLE `discs` ADD CONSTRAINT `discs_fk0` FOREIGN KEY (`album_id`) REFERENCES `albums`(`gid`);
ALTER TABLE `tracks` ADD CONSTRAINT `tracks_fk0` FOREIGN KEY (`disc_id`) REFERENCES `discs`(`gid`);

ALTER TABLE `artist_band` ADD CONSTRAINT `artist_band_fk0` FOREIGN KEY (`artist_id`) REFERENCES `artists`(`gid`);
ALTER TABLE `artist_band` ADD CONSTRAINT `artist_band_fk1` FOREIGN KEY (`band_id`) REFERENCES `bands`(`gid`);
ALTER TABLE `artist_album` ADD CONSTRAINT `artist_album_fk0` FOREIGN KEY (`artist_id`) REFERENCES `artists`(`gid`);
ALTER TABLE `artist_album` ADD CONSTRAINT `artist_album_fk1` FOREIGN KEY (`album_id`) REFERENCES `albums`(`gid`);
ALTER TABLE `artist_track` ADD CONSTRAINT `artist_track_fk0` FOREIGN KEY (`artist_id`) REFERENCES `artists`(`gid`);
ALTER TABLE `artist_track` ADD CONSTRAINT `artist_track_fk1` FOREIGN KEY (`track_id`) REFERENCES `tracks`(`gid`);

ALTER TABLE `genre_artist` ADD CONSTRAINT `genre_artist_fk0` FOREIGN KEY (`genre_id`) REFERENCES `genres`(`gid`);
ALTER TABLE `genre_artist` ADD CONSTRAINT `genre_artist_fk1` FOREIGN KEY (`artist_id`) REFERENCES `artists`(`gid`);
ALTER TABLE `genre_band` ADD CONSTRAINT `genre_band_fk0` FOREIGN KEY (`genre_id`) REFERENCES `genres`(`gid`);
ALTER TABLE `genre_band` ADD CONSTRAINT `genre_band_fk1` FOREIGN KEY (`band_id`) REFERENCES `bands`(`gid`);
ALTER TABLE `genre_album` ADD CONSTRAINT `genre_album_fk0` FOREIGN KEY (`genre_id`) REFERENCES `genres`(`gid`);
ALTER TABLE `genre_album` ADD CONSTRAINT `genre_album_fk1` FOREIGN KEY (`album_id`) REFERENCES `albums`(`gid`);
ALTER TABLE `genre_track` ADD CONSTRAINT `genre_track_fk0` FOREIGN KEY (`genre_id`) REFERENCES `genres`(`gid`);
ALTER TABLE `genre_track` ADD CONSTRAINT `genre_track_fk1` FOREIGN KEY (`track_id`) REFERENCES `tracks`(`gid`);

To explain a bit:

An artist can have multiple genres (their "default" so to speak).
A band can have multiple artists (or none if they are unknown).
- A band may inherits the genres from its artists, and can have more genres.
An album has a single band (might be null?)
- An album may inherits artists from its band, and can have more artists
- An album may inherits genres from its band, and can have more genres
A disc has a single album (cannot be null)
A track has a single disc (cannot be null) and inherits a single album from the disc.
- A track may inherits artists from its album, and can have more artists
- A track may inherits genres from its album, and can have more genres
- A track may inherits the year of the album, and can override this year

This "recursive" design allows powerful combination of genres and artists and is very versatile. But it is also more complicated to populate in a useful way (it is always possible to populate only the last level of the hierarchy, but this breaks the usefulness of such hierarchy).

I hope this can inspire you in some way ;)

MightyCreak commented 5 years ago

I don't know how difficult it would be to implement this model, but I like it.

I like it especially because it helps to have a nice search algorithm (we could filter by genre for instance).

That being said, I'd like to be a bit pragmatic in this case. If it's too much hassle to do this design, I'd rather have the simpler design than nothing. So it's up to @IGuillaume124 to decide in the end 😉

csdt commented 5 years ago

I don't think it is difficult to implement, except for the automatic population from file tags. This automatic population can be made easy if you populate only the last level of the hierarchy, and let the user/admin populate the other levels.

But I agree, that's up to @lGuillaume124 in the end ^^

lGuillaume124 commented 5 years ago

Thank you @csdt I didn't expect such contribution!

Your proposition is very interesting, but I'm afraid it take too much time to me to implement it. I don't want to start something and abort in a few months because I'm stuck. I'm already struggling with the new simple schema I proposed earlier :confused:

As @MightyCreak said:

we'll see later if we can do better

I prefer starting to work on something smaller, and, why not, go to something like your proposition after several more iteration. Your work stay here, the issue will remain open :) I hope you understand.

So, does everyone agree with the following schema?

CREATE TABLE `albums` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(255) NOT NULL,
  `cover` varchar(37) DEFAULT NULL,
  `band_id` int(11) DEFAULT NULL,
  `year` int(4) DEFAULT NULL,
  `created` timestamp NOT NULL DEFAULT current_timestamp(),
  `updated` timestamp NULL DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `albums_bands_id_fk` (`band_id`),
  KEY `albums_name_index` (`name`),
  CONSTRAINT `albums_bands_id_fk` FOREIGN KEY (`band_id`) REFERENCES `bands` (`id`)
)

CREATE TABLE `bands` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(255) NOT NULL,
  `created` timestamp NOT NULL DEFAULT current_timestamp(),
  `updated` timestamp NULL DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `bands_name_index` (`name`)
)

CREATE TABLE `tracks` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `title` varchar(255) DEFAULT NULL,
  `source_path` varchar(1024) NOT NULL,
  `playtime` varchar(9) DEFAULT NULL,
  `track_number` int(5) DEFAULT NULL,
  `disc` varchar(7) DEFAULT NULL,
  `year` int(4) DEFAULT NULL,
  `genre` varchar(255) DEFAULT NULL,
  `artist` varchar(255) DEFAULT NULL,
  `created` timestamp NOT NULL DEFAULT current_timestamp(),
  `album_id` int(11) DEFAULT NULL,
  `imported` tinyint(1) NOT NULL DEFAULT 1,
  `updated` timestamp NULL DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `tracks_albums_id_fk` (`album_id`),
  KEY `tracks_title_index` (`title`),
  KEY `tracks_artist_index` (`artist`),
  CONSTRAINT `tracks_albums_id_fk` FOREIGN KEY (`album_id`) REFERENCES `albums` (`id`)
)

csdt commented 5 years ago

Thank you @csdt I didn't expect such contribution!

Your proposition is very interesting, but I'm afraid it take too much time to me to implement it. I don't want to start something and abort in a few months because I'm stuck. I'm already struggling with the new simple schema I proposed earlier

I understand, and I'm completely fine with this.

Just a quick question about your scheme: Why track_number is an int while disc is a varchar? AFAIU, those two mp3 tags are formatted the same: either n/N or n. I would recommend to have:

CREATE TABLE `tracks` (
  ...
  `track_number` int(5) DEFAULT NULL,
  `max_track_number` int(5) DEFAULT NULL,
  `disc_number` int(5) DEFAULT NULL,
  `max_disc_number` int(5) DEFAULT NULL,
  ...
)

Also, I would recommend a varchar(4096) for the source_path to match PATH_MAX in linux.

MightyCreak commented 5 years ago

Completely agree with @csdt's comment.

I'm also wondering what is the purpose of imported in tracks?

And finally, I would really like to have the total time of an album displayed in the UI. It could be done in CPU (SELECT SUM(playtime) IN albums, tracks WHERE album_id = album.id or something like that) or it could be computed at import time and cached in the DB. What do you think?

csdt commented 5 years ago

And finally, I would really like to have the total time of an album displayed in the UI. It could be done in CPU (SELECT SUM(playtime) IN albums, tracks WHERE album_id = album.id or something like that) or it could be computed at import time and cached in the DB. What do you think?

Personally, I prefer to compute the sum every single time: like that, you are always sure your database is in a consistent state. Otherwise, you would have to fix the sum every time you modify the tracks of an album (adding/deleting tracks for instance).

Performance wise, if you have an index on tracks.album_id, it should be really fast to compute such a sum.

MightyCreak commented 5 years ago

I completely trust you on that one 😉

lGuillaume124 commented 5 years ago

I've added your proposition @csdt

I'm also wondering what is the purpose of imported in tracks?

Used during my tests, it shouldn't be here sorry.

So here we are!

CREATE TABLE `albums` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(255) NOT NULL,
  `cover` varchar(37) DEFAULT NULL,
  `band_id` int(11) DEFAULT NULL,
  `year` int(4) DEFAULT NULL,
  `created` timestamp NOT NULL DEFAULT current_timestamp(),
  `updated` timestamp NULL DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `albums_bands_id_fk` (`band_id`),
  KEY `albums_name_index` (`name`),
  CONSTRAINT `albums_bands_id_fk` FOREIGN KEY (`band_id`) REFERENCES `bands` (`id`)
)

CREATE TABLE `bands` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(255) NOT NULL,
  `created` timestamp NOT NULL DEFAULT current_timestamp(),
  `updated` timestamp NULL DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `bands_name_index` (`name`)
)

CREATE TABLE `tracks` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `title` varchar(255) DEFAULT NULL,
  `source_path` varchar(4096) NOT NULL,
  `playtime` varchar(9) DEFAULT NULL,
  `track_number` int(5) DEFAULT NULL,
  `max_track_number` int(5) DEFAULT NULL,
  `disc_number` int(5) DEFAULT NULL,
  `max_disc_number` int(5) DEFAULT NULL,
  `year` int(4) DEFAULT NULL,
  `genre` varchar(255) DEFAULT NULL,
  `artist` varchar(255) DEFAULT NULL,
  `created` timestamp NOT NULL DEFAULT current_timestamp(),
  `album_id` int(11) DEFAULT NULL,
  `updated` timestamp NULL DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `tracks_albums_id_fk` (`album_id`),
  KEY `tracks_title_index` (`title`),
  KEY `tracks_artist_index` (`artist`),
  CONSTRAINT `tracks_albums_id_fk` FOREIGN KEY (`album_id`) REFERENCES `albums` (`id`)
)

Thank you everyone!

csdt commented 5 years ago

You seem to have forgotten to change source_path to varchar(4096). As MAX_PATH on linux is 4096, I think that important to change it.

Otherwise, I'm fine with it.

lGuillaume124 commented 5 years ago

Hi,

Some news of the project. I almost finished to rewrite the synchronization system (i.e the code to import / updated and clean the database) (#348). It's been a long and hard way, but I learned a lot of new things and the most difficult part of the refactoring is now done.

The new import page looks like this (sorry for the bad quality, click to enlarge): peek 20-01-2019 19-01

Errors and bad files are properly handled.

Now I focus on the CLI to allow basic, but reliable automation, then I will write the new controllers and views to use the new database schema, and Sonerezh 2.0.0 will be ready \o/

Nizhile commented 5 years ago

I take a quick look on the current database and on the new proposition. I do not know how are your files on disks, but mine are usually one directory/one album. Several albums contain tracks recorded by different artists (compilation/collaboration/...) Current release makes these tracks belongs to different album, even trigerring thumbnails, as it seems that albums are referenced by the key artist . album In next release, if the database schema uses a band table, what is its content? What is it good for? If two albums share the same title, the files are probably in different directory. Why not using path . album as key album? Currently, Greatest Hits are correctly recognized for the different artists. But songs with artist feat. somebody else trigger several albums. @lGuillaume124 , keep the database simple, no need to get a self-hosted MusicBrainz 😁

MightyCreak commented 5 years ago

@Nizhile For albums with various artists, setting only the artist tag on each track is not enough, you also need to set the band (a.k.a. album artist).

I don't think using string values as DB keys is a good practice. All in all, since the folder tree of a music library may vary from user to user, I think it's better to rely solely on the songs ID3 tags. It's more explicit and easier to fix: if there's a problem, look at the tags.

Nizhile commented 5 years ago

@lGuillaume124 I just realized that most of my songs metadata need review. MusicBrainz provides covers that trigger timeout on import due to image too large, and most of my old compilations do not provide any album artist (Various or whatever). Thanks for the work anyway.

MightyCreak commented 5 years ago

I encourage you to import your library progressively so you can detect which album is problematic (and remove it for the time being). In the end you'll only have (hopefully) a small list of music to re-tag.

lGuillaume124 commented 5 years ago

MusicBrainz provides covers that trigger timeout on import due to image too large

Yes, I noticed that, but unfortunately I can't do anything on this third-party library…

most of my old compilations do not provide any album artist

I'm using http://beets.io/ to manage my collection. It's the best tool I know to do such a work!

lGuillaume124 commented 5 years ago

@Nizhile here's what a properly tagged compilation should look like:

capture d ecran_2019-01-30_20-09-30

band tag is "Artistes Divers"
artist tag is displayed under each title, like "The Who" or "Cat Stevens"

Nizhile commented 5 years ago

@MightyCreak , I'll write some script to review the files 😁

@lGuillaume124, I usually use mp3tag for all tags update. Original files created using CDDB information 😕. Anyway, for more recent files, the current database schema is sufficient and rendering is great.

Nizhile commented 5 years ago

The digest for thumbnails should probably be changed in a next release. The example in https://github.com/Sonerezh/sonerezh/issues/345#issuecomment-459282747 shows an album with correct albumartist set. Anyway, thumbnails are computed using $metadata['artist'].$metadata['album']: there is one thumbnail for each track. This is probably useless and make browsers handle multiple copies in cache for the same cover. I'm using $metadata['band'].$metadata['album'] on my install instead, but my collection is currently small.

Nizhile commented 5 years ago

I'm using $metadata['band'].$metadata['album'] on my install instead, but my collection is currently small.

I changed once more this to hash directly image content as the thumbnail name. My collection is still a test one (8 albums) but even using very big covers, import does not trigger problem using php -S.

I take a look on 2.0 branch and I notice that there is no general table. Usually, you can use such a table to store the database schema version. Then application can trigger database migrations according to supported schema version.

lGuillaume124 commented 5 years ago

I take a look on 2.0 branch and I notice that there is no general table. Usually, you can use such a table to store the database schema version.

I didn't open any issue about this yet, but I was thinking to include such information in a new settings table. The current table isn't convenient to add new options and it could be refactored like:

create table settings
(
    id int auto_increment,
    name varchar(255) not null,
    value tinytext null,
    constraint settings_pk
        primary key (id)
);

create unique index settings_name_uindex
    on settings (name);

Sonerezh / sonerezh

Normalize the database #345