Lyrics / lyrics-api

Dynamic website for the lyrics database
GNU Affero General Public License v3.0
2 stars 0 forks source link

Search refinement #6

Closed defanor closed 5 years ago

defanor commented 5 years ago

So far we've got full-text search, but it may be too fuzzy, and it's hard to avoid false positives with it. Perhaps it should be (partially) replaced (or combined) with preprocessing: apparently lyrics.wikia.org does just that, as mentioned in issue #5.

defanor commented 5 years ago

A bit more on lyrics.wikia.org: it seems to just drop everything after "(", "feat.", "ft.", in both song title and artist. Matching isn't case-sensitive, but it doesn't seem to handle misplaced diacritics. And there's a separate full-text search on the website.

defanor commented 5 years ago

Something like Text::Unidecode could be used for both database population and further querying, as a part of preprocessing without FTS5.

defanor commented 5 years ago

Pushed it into the exact-matching branch, to try out and see how it works.

defanor commented 5 years ago

Perhaps preprocessing should remove punctuation too, but seems to work fine, and can be improved later. I'll merge it tomorrow, unless something will come up.

FTS was one of the SQLite advantages, which isn't as relevant if we're doing preprocessing instead, although could as well continue with it.

defanor commented 5 years ago

Merged. Once we'll add aliases, these search fields should probably just become aliases, but this should be enough to proceed with issues #4 and #5.

defanor commented 5 years ago

In addition to handling different spellings and alternative band names, aliases are also needed to make the database usable by other software: Perl's unidecode is handy, but only usable from Perl programs, so it would be nice to provide alternative (and easier to compose) aliases as well. Maybe also marking their types (e.g., "alternative name", "a name processed with unidecode", "MusicBrainz ID", etc).

defanor commented 5 years ago

I'm thinking of using the following schema, though maybe should split aliases into 3 tables to avoid using alias_column (but then there will be duplication in schema):

CREATE TABLE lyrics (
       lyrics_id INTEGER PRIMARY KEY,
       artist TEXT NOT NULL,
       album TEXT NOT NULL,
       title TEXT NOT NULL,
       text TEXT NOT NULL
);

CREATE TABLE aliases (
       lyrics_id INTEGER NOT NULL,
       alias TEXT NOT NULL,
       alias_column TEXT NOT NULL,
       alias_type TEXT NOT NULL,
       FOREIGN KEY (lyrics_id) REFERENCES lyrics(lyrics_id)
);
CREATE INDEX idx_aliases_column_and_type ON
       aliases (alias_column, alias_type, alias);
CREATE INDEX idx_aliases_column ON aliases (alias_column, alias);

Test data:

insert into lyrics (artist, album, title, text) values ('foo', 'bar', 'baz', 'qux');
insert into aliases (lyrics_id, alias, alias_column, alias_type) values (1, 'BAZ', 'title', 'uppercased');
insert into aliases (lyrics_id, alias, alias_column, alias_type) values (1, 'FOO', 'artist', 'uppercased');
insert into aliases (lyrics_id, alias, alias_column, alias_type) values (1, 'foo', 'artist', 'original');
insert into aliases (lyrics_id, alias, alias_column, alias_type) values (1, 'baz', 'title', 'original');

The queries would then be like the following two:

select * from lyrics where
lyrics_id in (select lyrics_id from aliases where
  alias_column='artist' and alias_type='uppercased' and alias='FOO')
and
lyrics_id in (select lyrics_id from aliases where
  alias_column='title' and alias_type='uppercased' and alias='BAZ');

select * from lyrics where
lyrics_id in (select lyrics_id from aliases where
  alias_column='artist' and alias='FOO')
and
lyrics_id in (select lyrics_id from aliases where
  alias_column='title' and alias='BAZ');

Those indexes are sufficient for SQLite to use them for search in both cases (by aliases first, and then fetching the lyrics). With original names being among aliases, it's a bit redundant (and potentially inconsistent), but easier to fetch names for output directly from the result.

Those potential inconsistencies and redundancies seem rather awkward, though maybe not too bad for a database that gets created and then only read from. Yet maybe it's worth trying a nicer structure.

defanor commented 5 years ago

It makes me to wonder whether librdf would be more suitable for such a data model. Though probably not worth switching now, and as long it's small and simple, it could easily be changed later.

Regarding types of aliases (alias_type): I think most of those should be processing types that are easy to do in any common language, and they should be specified too. If there won't be too many, we could provide combinations of those, but otherwise they should probably be ordered by complexity -- e.g., first applying the ones that don't require dealing with unicode at all (e.g., cutting out everything after the first opening paren, whitespace removal, ascii punctuation removal), then applying to the results the ones that require dealing with it (such as lowercasing, unicode punctuation removal), then Unidecode.

defanor commented 5 years ago

Introduced the aliases. So far just for different kinds of preprocessing, but aiming alternative band names and such.

Search improvement is likely to be an ongoing task for this project, but it seems to be sufficiently sorted out for now.