Search needs improvements and an API

C0rn3j commented 5 years ago

1. Search is lacking

At the moment I can't even search by title or album, much less both via some API.

There's Oleander https://lyrics.github.io/db/o/oleander/

with album Joyride https://lyrics.github.io/db/o/oleander/joyride/

with a song Runaway train https://lyrics.github.io/db/o/oleander/joyride/runaway-train/

Yet the only working search from these is the last one. https://lyrics.github.io/search.html?q=oleander https://lyrics.github.io/search.html?q=joyride https://lyrics.github.io/search.html?q=runaway%20train

2. Search API

For the API part, I'd for example like to have the lyrics show up in Deadbeef. For that there'd need to be a plugin that'd work via an API. There are plugins for sites like lyrics.wikia.com, and I can see they use an API that searches for both artist and the song.

https://github.com/loskutov/deadbeef-lyricbar/blob/master/src/utils.cpp#L25

No API is a show stopper IMO, this project would be infinitely more useful if things like music players could rely on it.

3. Translations

Looks like the /Lyrics repo has a translations folder but it doesn't seem to be supported at all by the site?

snshn commented 5 years ago

This merge https://github.com/Lyrics/lyrics/pull/217 has fixed the issue with Oleander's song, please take a look. Now those links to search.html that you listed above work.

Here's what's going on: the lyrics.github.io website is essentially a collection of static .html files, since GitHub only allows to host non-dynamic (no execution on the server) files. The search is implemented by querying the GitHub's API with the lyrics repo being the target. It has more flaws than the fact that it doesn't include file names into search results by default (e.g. there's a limit on the amount of requests users who aren't currently logged in into their github.com can make, like 10 search requests per minute or so). GitHub's search engine is optimized for searching through large amounts of code, which works for finding lyrics, but in order to find something by a file/album/artist name, that information needs to be within the file (which metadata takes care of).

Regarding Deadbeef and other players, the original idea that I've had was to download the whole copy of lyrics.git to be stored locally, and then search for lyrics not via a remote API, but rather using metadata within the locally stored database. Aside from privacy and speed, this approach is more reliable (e.g. any smartphone today could search through hundreds of megabytes of lyrics stored locally, not needing any Internet connection). But valid point about the API, perhaps it could be added if this projects gets its own server and a domain name, however it's not exactly there yet... lyrics.git needs work and become popular enough in order to justify that transition from the free GitHub pages hosting that it's currently using.

C0rn3j commented 5 years ago

Nice that the Oleander song is fixed, but as you pointed out, practically any other song has the same issue right now.

If you add the functionality, I'll gladly host it if my VPS can manage (I don't see why it wouldn't at this point with the DB being tiny 3.5K files).

Regarding Deadbeef and other players, the original idea that I've had was to download the whole copy of lyrics.git to be stored locally, and then search for lyrics not via a remote API, but rather using metadata within the locally stored database.

That's an interesting approach. One issue I see with this is the number of files. At this point it's 3.5K~, but if this project gets bigger, having XXXk number of small files could become an issue (just try copying such amount of files to a microSD card for example, it's going to take a long while, and compiling it into a single big database might prove better.

snshn commented 5 years ago

As a temporary solution, I could write a shell script to crawl the database, add metadata to every file that is necessary to make the search by album/artist/title possible. As another solution, the search logic on the website itself could be improved to parse sitemap.xml in addition to hitting the GitHub search API to seek lyrics by albums/artists/titles using that approach.

Since the database is open and available in full, please feel free to create an API server and host it on your server, we could host the source code repo within the Lyrics organization and share the link to your server's API endpoint for others to use. Currently the database lacks enough metadata, but at least searching by artists/albums/titles using file names could be implemented this way.

It could also be possible to leverage the GitHub's code search engine a little more, e.g. looking for path:database/A/ARTIST when searched for artist:ARTIST, and filename:SONGNAME when searched for title:SONGNAME on the website. Here's a reference to the full documentation: https://help.github.com/articles/searching-code/

Regarding using lyrics.git as a local database, I'd think by the time it becomes a popular source of lyrics, it's likely to contain hundreds of thousands, or even millions of compositions. At that point it'll be only possible to work with that amount of data latency-free if converted into a sqlite (or any other) database, which I believe many apps/programs already do when it comes to working with large amounts of data (since most filesystems have a high overhead and many other limitations when it comes to working with thousands of files or more).

defanor commented 5 years ago

since most filesystems have a high overhead and many other limitations when it comes to working with thousands of files or more

It's not that bad with modern filesystems (see https://unix.stackexchange.com/questions/28756/what-is-the-most-high-performance-linux-filesystem-for-storing-a-lot-of-small-fi), though a database should indeed be more suitable for search.

I like the idea of using the database locally, but here's a few things to consider regarding an online service's API:

Some music players rely on more or less similar lyrics scraping across all databases (websites), such as https://github.com/clementine-player/Clementine/blob/master/data/lyrics/ultimate_providers.xml, so it would be nice to make this API compatible with such an approach. Since we're aiming a web interface too, could just make it (or at least xpaths to the important parts) stable and specified, and perhaps aim the XHTML concrete HTML syntax for ease of use with regular XML tools.
Some basic/minimalistic players and plugins (such as the ncmpc front end for mpd) tend to rely on lyrics plugins that have much heavier dependencies (Python, Ruby): would be nice to make the API easy enough to use even from plain C and without additional dependencies (that is, plain TCP and minimal parsing). This can be done with an HTTP-based API too, if non-TLS connections will be allowed, no redirects will be used, plain texts will be served.
While the database is small, it's not likely to be a welcome addition into music players that support lyrics databases (and provide lists of those, without a way to tweak them without tweaking the sources). So, it might be useful to mimic APIs of some of the larger databases in order to make it easily usable.
For the search, full-text search probably isn't necessary (and may just introduce ambiguity if not used carefully), but to deal with different spellings, there would have to be aliases/synonyms. Also will have to handle (ignore?) "(live)" and similar marks in client-provided titles, perhaps. Maybe full-text search with a custom dictionary would be handy for that, actually.
To avoid dealing with the zoo of data models, there's the music ontology, which includes lyrics (http://musicontology.com/specification/#term-Lyrics) and some metadata. Since it's not widely used (AFAICT), it's tricky to rely only on that, but at least is should be easy to just embed RDFa into served HTML documents. Actually could also use SPARQL to provide a somewhat standard querying, without inventing a custom one (though standard and straightforward queries won't be able to drop "(live)", for instance).
Music player queries usually include artist name and song title, but album names may be unavailable, or other parameters can be available. The API should handle optional parameters (or, again, just be a SPARQL endpoint).
Some search queries will be ambiguous (e.g., if there are same songs on different albums, and it's not known which album the song in question came from). I think a more or less generic approach to handle it would be to serve the best guess lyrics along with a remark about other options, though could just provide a parameter that would allow to get best guess lyrics only, a list of options only, or even all the matching lyrics at once. This can be combined with the regular browsable web interface too, using ambiguous (artist-only, artist+album) requests/queries for listing.
It might be desirable to use the same DBMS/schema/dumps for local access and the online service, so that we won't have to maintain and deal with the same issues on multiple ones. SQLite would probably be nice to distribute/update as a local database, but not sure if its functionality would be sufficient: there's no stored procedures, for instance, so dealing with those "(live)" marks would have to happen in music players or their plugins (and that'd be dealing with the same issues in multiple places, too). We could provide a library, of course, but that'd be an additional and uncommon dependency. Although we could also write a generic library for lyrics retrieval from multiple sources (if there's no such library around yet, but I'm not aware of one), including this database -- potentially more useful, perhaps with a higher chance of adoption.

Putting all those together, the API may be all HTTP-based, serve as a web interface (and maybe a SPARQL endpoint) at once, and only consist of search with optional parameters. But with a few different representations (alternate versions) in order to be easily usable for different music players. And not sure which DBMS to use.

defanor commented 5 years ago

A bit more on DBMSes: querying with SPARQL or SQL would also let clients to easily switch between using a remote HTTP server and a local database (using the same queries). While librdf supports file-based storage modules (including an SQLite database, http://librdf.org/docs/api/redland-storage-modules.html), and can be used to read RDFa data from web pages. Combined with the music ontology, this may let us to avoid duplication and reinvention of a few things, while still let to distribute the database as a single portable file.

C0rn3j commented 5 years ago

@defanor do you want to take a stab at creating the API? While I could do it, it'd take me an eternity since I'd be learning almost everything from scratch.

defanor commented 5 years ago

I can at least implement a prototype, but first we need to figure what the interface itself should be like. For instance, while I like RDF, it may be rather obscure, and while using SPARQL directly for querying has its advantages, we'll be stuck with it then (which is perhaps worse than being stuck with SQL or with arbitrary URL query parameters). Another issue with SPARQL is the lack of quick text search: it should be fine if we'll match on artist name first (aliases can be used easily, and then there will be just a few song titles to check with filter), but for a web service/website it would be an issue. Though neither does stock SQLite without plugins have suitable indexes, AFAIK. String searching even without indexes wouldn't take long enough to notice on the current database, I think, and has no perceivable delay on an SQLite table with 15 thousand records here, but it wouldn't be nice to get stuck with a public API that is hard to reimplement and attached to a slow database.

I've looked up Debian package installation statistics (https://popcon.debian.org/main/by_inst) now, to get a rough estimate of different DBMS libraries being already available on client machines (for the local database access), here they are:

#rank name                            inst  vote   old recent no-files (maintainer)
66    libsqlite3-0                   197493 126373 41500 15909 13711 (Laszlo Boszormenyi)
468   libxapian30                    122098    89  1006    67 120936 (Olly Betts)
620   libpq5                         108101 12279 13521  2198 80103 (Debian Postgresql Maintainers)
722   librdf0                        101575    62   563    28 100922 (Dave Beckett)
3686  postgresql                     14866    45    35     7 14779 (Debian Postgresql Maintainers)
6252  virtuoso-opensource-6.1-common  4885   385  4433    61     6 (Debian Qa Group)

So, while libsqlite3 is installed almost everywhere (and wouldn't be a burden as a dependency), librdf seems to be already present on more than a half of the systems (similar to make, gcc, etc), while postgresql server is rather rare. I've included xapian too, because it could take care of indexing/search, and then we could consider still using files and git, or DBMSes without nice search (but that'd be harder to use from plugins, would require a custom API, etc).

Apparently we'd have to choose which of the listed interface and implementation properties to go after.

defanor commented 5 years ago

I forgot: xapian has query parser that will probably be usable on its own. So could achieve the same as with SPARQL or SQL: same queries for local lookups and requests to an online service. Though then it'd be full-text search, and while it's customizable, not sure if it can be told to just ignore (or assign low weight to) any "(live in <place>)" marks.

Update: though using the same queries isn't that important, especially if it won't be SPARQL: xapian's querying language is rather far from standard for data querying (neither will the field names be standard, while with the music ontology and SPARQL the whole API could be described just as "a SPARQL endpoint serving the music ontology data"), and querying from this database doesn't have to be flexible anyway.

C0rn3j commented 5 years ago

This is the first time I heard about SPARQL/RDF and I had to look it up. It could be a better idea to base on something more common like SQL, but I don't really have much of an idea about intricacies of either.

What are we actually solving here local-wise? Are there music players that are capable, without plugins/modifications to query local files/DB for lyrics and other metadata?

You're saying "same queries for local lookups and requests to an online service." which I imagine leads to a 'no'. Sorry for basic questions, I just have no idea about the state of music players.

Btw http://musicontology.com is down right now, and Google cache doesn't seem to load, might be a good idea to post a copy of it in here when it's back up.

defanor commented 5 years ago

This is the first time I heard about SPARQL/RDF and I had to look it up.

Yes, as mentioned above, it's quite obscure. Though librdf is installed surprisingly commonly, it seems, and clients don't really have to master it in order to issue typical queries. But there are other reasons why it may not be suitable.

What are we actually solving here local-wise? Are there music players that are capable, without plugins/modifications to query local files/DB for lyrics and other metadata?

I'm not aware of such, but what I think would be nice for plugins/modifications is to be easy to write, and to require minimal additional dependencies (preferably nothing large, and something that's usually already installed). That should increase chances of local database adoption (not just the API). Maybe they shouldn't be mixed together at all, but it would be nice (easier) to maintain a single database/schema for both, and the same functionality is needed for both.

defanor commented 5 years ago

To summarize DBMS/storage/search options:

PostgreSQL: Fast and flexible search (FTS, trigrams), but a potentially undesirable dependency for plugins using local database.
SQLite: Not much of a dependency, but doesn't have a flexible text search without plugins.
Files: Similar to SQLite for this purpose, maybe slower.
librdf: Quite commonly available, and can be used with the music ontology, but only strict matching is fast, as with the previous two.
xapian: Common and provides full-text search, not intended to be used for storage itself. But no C API in sight, which is rather bad for plugins using a local database.

So, there are distinct groups: the ones with good (flexible and indexed) search (which we don't really need yet, but hopefully will grow enough to need it someday), and the ones that are relatively easily embeddable into arbitrary software (which we don't need at once either, but also aim).

Full-text search is only needed for search by song name alone, which is only required on the website. Otherwise strict matching on artist name/aliases should narrow the records sufficiently to just go through them (somehow, not sure how exactly) without a fancy index, although FTS could still be useful for finding the closest match among available records.

As an alternative to full-text search, we could consider some basic custom preprocessing (e.g., dropping everything after the first opening paren, lower-casing the titles, etc), but it's likely to be quite unreliable. FTS wouldn't be perfect either, and I guess it's not realistic to set aliases for every song title variation.

defanor commented 5 years ago

Turns out FTS5 (the full-text search extension) is included into SQLite since version 3.9. It's not in CentOS yet, but already in Debian stable, and likely in other common distributions. So, SQLite looks like a pretty good option.

defanor commented 5 years ago

Trying it now, and apparently a bit of query preparation (and/or preprocessing) will be needed anyway: we can't just require all the terms (because of those marks), and simply using OR between every term would be quite ambiguous. But trying to get the longest prefix should work fine (since the marks are usually in the end): e.g., turning "foo bar baz" title into a ... MATCH '("foo bar baz") OR ("foo bar") OR ("foo")' ORDER BY rank query.

Going to fill a database with lyrics from files tomorrow, then we could poke it more, see how it works, and perhaps proceed to a website/API then.

defanor commented 5 years ago

Test schema (just search fields, no lyrics): CREATE VIRTUAL TABLE lyrics USING fts5(artist, album, title);, awk script:

{
    FS = "/";
    if ($4 != "") {
        gsub(/'/, "''");
        printf "INSERT INTO lyrics VALUES ('%s', '%s', '%s');\n", $2, $3, $4
    }
}

Can be invoked with find [A-Z] -print | awk -f query.awk | sqlite3 -batch lyrics.db from the database directory, to fill the database.

We'd probably need much more data and check it with actual playlists to be sure, but so far seems to work fine with queries akin to the one above (e.g., select artist,title,rank from lyrics where lyrics match 'title : "foo bar" OR title : "foo"' order by rank;). Feel free to try it too, and if it's fine, we could proceed more or less confidently. Here is FTS5 documentation, by the way: https://www.sqlite.org/fts5.html.

defanor commented 5 years ago

The "initial token queries" should be useful for longest prefix search, but they don't seem to be supported in sqlite 3.16.2 (the one in Debian stable).

Regarding further steps: we'll need to decide which language(s) to use for the website/API, as well as for the database filling tool. I'm thinking of Perl (perhaps with CGI, and with XSLT for templating): not that I like it or used in the past few years, but it's commonly available, and should be easily editable by everyone. And still less awkward/error-prone than longer shell scripts. Though I'm open to suggestions.

C0rn3j commented 5 years ago

@defanor if you'd like I can spin you up a container with Arch (sqlite 3.26.0) with SSH and HTTP/HTTPS access.

defanor commented 5 years ago

Thanks, but no need (at least not until we'll have something to deploy): I've just built 3.26 for development here, and apart from "disconnect", it seems to work with the old Perl module. And I could always run a local VM, but it's more comfortable without those.

I'm going to poke relevant Perl modules today and tomorrow (sqlite and libxml/libxslt bindings, CGI), and see how it goes.

defanor commented 5 years ago

Wrote a rough script to fill an sqlite database with lyrics, it's 4 MB with 2488 texts (with text UNINDEXED column in addition to the ones mentioned above). But this doesn't account for different spellings/aliases. Perhaps will postpone dealing with those for later, since the database itself doesn't keep track of them (and not sure if it should).

Also need to decide whether to put those into the "lyrics" repository, or a separate one. @vflyson, what do you think?

C0rn3j commented 5 years ago

For different forms themselves, I think a lot of it would get covered by converting everything down to the base character like so, so songs with weird unicode(ish) characters would have two entries, one with the accents, one without.

However at least in PHP this was extremely inefficient to do (not that I tried to optimize it)

#   $a = array('À', 'Á', 'Â', 'Ã', 'Ä', 'Å',  'Æ', 'Ç', 'È', 'É', 'Ê', 'Ë', 'Ì', 'Í', 'Î', 'Ï', 'Ð', 'Ñ', 'Ò', 'Ó', 'Ô', 'Õ', 'Ö', 'Ø', 'Ù', 'Ú', 'Û', 'Ü', 'Ý', 'ß', 'à', 'á', 'â', 'ã', 'ä', 'å',  'æ', 'ç', 'è', 'é', 'ê', 'ë', 'ì', 'í', 'î', 'ï', 'ñ', 'ò', 'ó', 'ô', 'õ', 'ö', 'ø', 'ù', 'ú', 'û', 'ü', 'ý', 'ÿ', 'Ā', 'ā', 'Ă', 'ă', 'Ą', 'ą', 'Ć', 'ć', 'Ĉ', 'ĉ', 'Ċ', 'ċ', 'Č', 'č', 'Ď', 'ď', 'Đ', 'đ', 'Ē', 'ē', 'Ĕ', 'ĕ', 'Ė', 'ė', 'Ę', 'ę', 'Ě', 'ě', 'Ĝ', 'ĝ', 'Ğ', 'ğ', 'Ġ', 'ġ', 'Ģ', 'ģ', 'Ĥ', 'ĥ', 'Ħ', 'ħ', 'Ĩ', 'ĩ', 'Ī', 'ī', 'Ĭ', 'ĭ', 'Į', 'į', 'İ', 'ı',  'Ĳ',  'ĳ', 'Ĵ', 'ĵ', 'Ķ', 'ķ', 'Ĺ', 'ĺ', 'Ļ', 'ļ', 'Ľ', 'ľ', 'Ŀ', 'ŀ', 'Ł', 'ł', 'Ń', 'ń', 'Ņ', 'ņ', 'Ň', 'ň', 'ŉ', 'Ō', 'ō', 'Ŏ', 'ŏ', 'Ő', 'ő',  'Œ',  'œ', 'Ŕ', 'ŕ', 'Ŗ', 'ŗ', 'Ř', 'ř', 'Ś', 'ś', 'Ŝ', 'ŝ', 'Ş', 'ş', 'Š', 'š', 'Ţ', 'ţ', 'Ť', 'ť', 'Ŧ', 'ŧ', 'Ũ', 'ũ', 'Ū', 'ū', 'Ŭ', 'ŭ', 'Ů', 'ů', 'Ű', 'ű', 'Ų', 'ų', 'Ŵ', 'ŵ', 'Ŷ', 'ŷ', 'Ÿ', 'Ź', 'ź', 'Ż', 'ż', 'Ž', 'ž', 'ſ', 'ƒ', 'Ơ', 'ơ', 'Ư', 'ư', 'Ǎ', 'ǎ', 'Ǐ', 'ǐ', 'Ǒ', 'ǒ', 'Ǔ', 'ǔ', 'Ǖ', 'ǖ', 'Ǘ', 'ǘ', 'Ǚ', 'ǚ', 'Ǜ', 'ǜ', 'Ǻ', 'ǻ',  'Ǽ',  'ǽ', 'Ǿ', 'ǿ', 'Ά', 'ά', 'Έ', 'έ', 'Ό', 'ό', 'Ώ', 'ώ', 'Ί', 'ί', 'ϊ', 'ΐ', 'Ύ', 'ύ', 'ϋ', 'ΰ', 'Ή', 'ή');
#   $b = array('A', 'A', 'A', 'A', 'A', 'A', 'AE', 'C', 'E', 'E', 'E', 'E', 'I', 'I', 'I', 'I', 'D', 'N', 'O', 'O', 'O', 'O', 'O', 'O', 'U', 'U', 'U', 'U', 'Y', 's', 'a', 'a', 'a', 'a', 'a', 'a', 'ae', 'c', 'e', 'e', 'e', 'e', 'i', 'i', 'i', 'i', 'n', 'o', 'o', 'o', 'o', 'o', 'o', 'u', 'u', 'u', 'u', 'y', 'y', 'A', 'a', 'A', 'a', 'A', 'a', 'C', 'c', 'C', 'c', 'C', 'c', 'C', 'c', 'D', 'd', 'D', 'd', 'E', 'e', 'E', 'e', 'E', 'e', 'E', 'e', 'E', 'e', 'G', 'g', 'G', 'g', 'G', 'g', 'G', 'g', 'H', 'h', 'H', 'h', 'I', 'i', 'I', 'i', 'I', 'i', 'I', 'i', 'I', 'i', 'IJ', 'ij', 'J', 'j', 'K', 'k', 'L', 'l', 'L', 'l', 'L', 'l', 'L', 'l', 'l', 'l', 'N', 'n', 'N', 'n', 'N', 'n', 'n', 'O', 'o', 'O', 'o', 'O', 'o', 'OE', 'oe', 'R', 'r', 'R', 'r', 'R', 'r', 'S', 's', 'S', 's', 'S', 's', 'S', 's', 'T', 't', 'T', 't', 'T', 't', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'W', 'w', 'Y', 'y', 'Y', 'Z', 'z', 'Z', 'z', 'Z', 'z', 's', 'f', 'O', 'o', 'U', 'u', 'A', 'a', 'I', 'i', 'O', 'o', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'A', 'a', 'AE', 'ae', 'O', 'o', 'Α', 'α', 'Ε', 'ε', 'Ο', 'ο', 'Ω', 'ω', 'Ι', 'ι', 'ι', 'ι', 'Υ', 'υ', 'υ', 'υ', 'Η', 'η');

defanor commented 5 years ago

FTS5 takes care of that (tried it with Abwärts/Abwarts, Dödelhaie/Dodelhaie), but I'm more worried about the kinds of different spelling that can't be dealt with using just character substitution: for instance, some bands change names, but tags don't necessarily reflect those properly. Some may also be commonly abbreviated. And transliteration between some languages is more tricky than that.

Meantime, I've composed a basic search script that expands queries to look for the longest prefix, and perhaps will proceed to the web bits next. Most likely will have to get back to search later (tweak the queries themselves and/or preprocessing), but it would be nice to have a working API prototype.

defanor commented 5 years ago

Created the https://github.com/Lyrics/lyrics-api repository, perhaps we could continue the API/DB discussion in issues there. The initial CGI script is ready (though didn't push yet, trying to figure which license to use: AGPL seems appropriate for web-facing software, LGPL/BSD/MIT would be more suitable for libraries if we'll have any, or if any would be based on the CGI script), currently it just serves up to 10 best matches. But probably should serve listings if no song title at all is provided, and/or if there's a bunch of results with roughly the same rankings. Well, there's many small things like that, will leave them for issues in that repository.

C0rn3j commented 5 years ago

Can just release with the more 'draconic' licenses and always relicense to different ones later if needed.

defanor commented 5 years ago

Indeed. Licensed under AGPL and pushed.

snshn commented 5 years ago

The database has been updated to contain basic metadata for all lyrics. An automated test to ensure this doesn't become an issue again in the future is in the making.

snshn commented 5 years ago

The aforementioned automated test is in place and lyrics.git now has Travis CI hooked up to it to check every commit to ensure the database retains its consistency.

Lyrics / lyrics.github.io