Music Direction Proposal!

I have an plan for where I could take the spotify integration side project. I've decided to play to my strengths! Time for a stream of consciousness :D

Things on my side:

I've got a pretty decent grasp on the objects and the overall structure of the Spotify REST API.
I've also got good practice in enforcing schemas on objects parsed from LLM outputs.
For my personal use cases, I know how to flexibly run local LLM and Embedding models.

What a cool endgame could look like:

A lightweight, AI powered music player that does everything you'd want, with a 100% offline option.

This has a lot of parts to it! Some of which are more in my domain than others. Here's what I think I could tackle:

User audio -> (locally-sourced) string -> (local) LLM output matching a schema -> API call to my (local) server -> Run action*
- = An action could be, in theory, anything that my level of python wizardry can squeeze into a function.

The hypothetical: Let's imagine I'm living in a rural town with my laptop, and there's no internet connection! What's the best music system I could build for myself given the constraints? Well, I can assume I'm limited by capabilities of my laptop. I'm used to doing a lot of stuff manually in times like those, but it's 2024, AI is still sexy, and we have a lot of tricks up our sleeve! Here are some voice powered examples I think would be so absurdly dope if they worked offline. Also, I want it to be called Moxie!!

"Hey Moxie, play Espresso"
"Hey Moxie, make a playlist with all my Eminem and Melanie Martinez songs"

For a more complicated example, like "Hey Moxie, request songs I might like", here's a some thoughts on the implementation:

In offline mode, this would create a "request" or a list of jobs that are cached. These jobs can be executed when connected to the internet.

A job might include:

API requests to providers like spotify, youtube, lyrics providers, search engines.
API request to run inference on a model that is: text-to-text (LLM), audio-to-text (like Whisper), text-to-audio (This is bigger for demos than core functionality, but perhaps as the tech improves, it'll keep getting cooler, so it's game), text-to-embedding (Hell yeah, of course we'll RAG it!)
Functions that access and cache data from the user's library^*
- ^* = This library information is implemented as a (local) database. (Things like song metadata, play count, last played, etc.). In theory, we can work with more complex info being stored, like the details of each listening session. This could be a good bedrock for a tool that can "learn from your habits!"
Functions that read/write to your music DB.
steve jobs (just to see if u fell asleep or not)

A flow for "Hey Moxie, request songs I might like:

API call from device to model to process audio -> text inference
Handle output (which server receives this? Am I agnostic to this, or can it only be the local device?)
Run actions (including LLM inference)
Request content from a provider, like spotify or youtube (e.g relevant playlists)
Read/write to the user's database.

Back to that original flow, the segment with: an output matching a schema -> API call to my (local) server -> Run action. I wonder what my perspective will be on this in a couple months!

** = These two steps will likely be the ones with the trickiest learning curve for me. On the positive side, they involve many of the things I'd like to practice! I want to create my own REST API for this, which I've never done! I've played around with GCP's cloud functions, but I'll still have to expand my understanding of the specifics for a use case with database access, real-time playback on a device, fine-grained to the level of volume control. Playback is handled as API calls to my network, with the potential to play it on any device connected to me (locally). I'll also get to practice collecting good quality, well defined metadata and work with a DB in a very practical app. I also want to keep in mind how scale might look as I tackle something like this. For instance, lets say I had 5 other members in my house, and they can connect to the network running on my laptop. Would the DB change? What if this goes full blown and I were to handle the DB and requests for a huge number of users, with no consideration of offline processing?

On that note, perhaps I should eat some lunch and take a nap. I'm not sure all of that made sense, but I think it's a fantastic leaping off point. I'll try not get too lose in the sauce of code and work on sketching this out over time!

TanGentleman / Augmenta

Music Direction Proposal! #37