adambechtold / taste-explorer

1 stars 1 forks source link

Optimization - Create Listen from Lastfm Listen / Separate Queue for Track Names Not Yet Researched #16

Open adambechtold opened 8 months ago

adambechtold commented 8 months ago

Background

Every time a user listens to a song, we receive a Last.fm Listen which is not linked to a Spotify Track.

Friction - Backfilling a new listener's history can take over 1 day

The system can keep up with the daily listens for the 50 listeners its tracking right now, but backfilling a new listener takes a very long time.

When a new listener is added, we ingest their entire listening history. This could be up to 100,000 Listens.

Right now, we can link roughly 720 unique tracks per hour. Depending on how much overlap is in a listener's history (or overlap with other new listeners being backfilled), that translates to 1k-40k events per hour. At that rate, backfilling a single listener can take over 1 day.

Bottleneck - Spotify API Calls

Our music service picks these listening events up one-by-one and performs the following steps:

1) Identify Matching Track a) Search for Exact Match by Track Name and Artist Name b) Search for a Previous Fuzzy Match* c) Search Spotify using Track Name and Artist Name** 2) Link All not-yet-researched Last.fm Listens with the same Track Name and Artist Name to This Track

*(i.e. have we previously linked a Last.fm Listen with this same track name and artist name) **(this is where fuzzy matches from from, Spotify's judgement)

Step 1.c (Search Spotify) is the bottleneck

Spotify's rate limit is low and sometimes unpredictable (see Appendix). When we get rate limited, we get rate limited very hard (e.g. wait >16 hours). To avoid this, we run the entire system very conservatively (i.e. the 720 req/hour).

Approaches

Approach - Identify Listen Events that Require Spotify and Process Them in a Separate Queue

Right now, we throttle the whole system based on Spotify's rate limit. So, even if most of the listening events are getting matched in our own database, we still keep the system running slowly. This avoids the risk of getting throttled by Spotify, but many listening events that could be process immediately must wait.

Variant - Put Listen Events that Require Spotify in a queue

Identify Matching Track Internally 1) Search Exact Match 2) Search Previous Fuzzy Match 3) Link or Defer

Search Spotify 1) Pull from Search Spotify Queue 2) Search Spotify 3) Link Track

Storage Options

Approach - Profile Spotify's Rate Limit and Increase Throughput to that Limit

The 720 requests per hour is a conservative limit. We could likely increase the speed of the entire system up to the actual limit, but we've found Spotify's limit to be unpredictable and inconsistent with its documentation (see appendix).

We could profile Spotify to find the actual limit.

➡️ Proposal - Process Spotify Events in a Separate Queue; Then determine if increasing Spotify processing speed is required

Isolating listening events that require Spotify...

Appendix

Spotify's Rate Limit doesn't seem to match its documentation

Resource - Spotify's Documentation - Rate Limits

Questions

adambechtold commented 8 months ago

Whiteboard Link: https://link.excalidraw.com/l/ATbTqvQ19d9/2v0lsO9nZUt

🙅‍♂️ Friction - The current logic is confusing and muddy

The current approach is weird.

image

The side effects of the function are unclear

The function getTrackFromLastfmListenId doesn't just get the Track, it also stores the track and links it to the last.fm listen.

An argument could be made that this is a kind of "transparent caching", but the scope of the function is just so large and the caching has a kind of permanent affect on the data.

It feels different than the kind of caching we do for playlist generation. That cache is an in-memory data store, and it just stores the response. It's not modifying persistent data.

The scope of the side effects is buried

The input is a single last.fm listen id, but it affects all last.fm listens with the same trackName and artistName. That's not clear at all from the function name getTrackFromLastfmListenId

There are no guarantees against race conditions

We select one last.fm listen at a time, but the call may change many last.fm listens that have the same trackName and artistName. There's nothing stopping us from operating over two last.fm listens at the same time the same name. It's not clear that this would cause issues, but it doesn't seem safe at first glance.

adambechtold commented 8 months ago

📈 Projected Benefits - 90% of listens will no longer be blocked; 25% reduction in queries to spotify

It's not clear what this will translate to in terms of events processed per hour, but it seems reasonable that it would a worthwhile chunk larger. Determine the exact figure would require more effort than seems worthwhile (especially given the non-performance related code-cleanliness that will come with refactoring this to make the logic more clear)

image
adambechtold commented 8 months ago

Step 1 - Simply Logic

Without making any performance enhancements, can we make the system more clear, so that it's easier to optimize?

Here's an approach—

image
adambechtold commented 8 months ago

Step 2 - Optimize with an Async "Queue"

Next, let's optimize.

Separate analysis into two cron jobs

🤖 Cron Job - linkLastfmListensToTracks

1) Get and Lock the next set of last.fm listen events

🤖 Cron job - Search Spotify for Track and Link to Listens

1) Get the next spotify query from the queue 2) Search spotify for the track 3) If found, add Track to the database 4) Store a reference to the track we created on the queue 5) Get all last.fm listens that have this trackName and artistName, and are not linked to a track 6) Link these to the track

📍 Position - Adam: This seems like it has a little too much going on. An alternative approach is to simply mark those last.fm listens as not analyzed, so that they get analyzed again by the previous cron job, but this time, the Track is there so they can be linked. I just don't like that we lose the history :/

Table - Spotify Track Search Queue

Field Type Contraints
id bigint autoincrement, unique
trackName string
artistName string
createdAt datetime
searchedAt datetime
trackId id foreign key to track
image

Appendix

❓ Question - How to represent the queue?

Approach - two datetimes

searchedLocalDbAt searchedSpotifyAt

Approach - enum

A single field like "analysisStatus"

▼ Con - I can't see the performance of each system ▲ Pro - Seems simpler

✅ Decision - Have a separate table where things get processed

table - last.fm listens

table - spotify search queue

▲ Pro - Feels very explicit

Approach - SQS or similar

▼ Con - Harder to run locally ▼ Con - Harder to observe ▼ Con - More expensive (like $0.40/mo...) ▲ Pro - Less stress on the database

⚖️ Consideration - We could change the logic of the previous search function to use the Spotify Track Search Queue table.

▲ Pro - A more explicit implementation ▼ Con - Ties us to using the the history search queue for match lookups, which might change (e.g. maybe one day the queue is more ephemeral, so we can't rely on it) ▼ Con - More work ▼ Con - Doesn't have a clear enough benefit; This doesn't deliver any top-line benefit. It's just a code-quality code-logic change. It's a good idea to record, but doesn't offer much tangible benefit right now ✅ Decision - Store this as a consideration in a comment in the code for future readers

adambechtold commented 8 months ago

Development Plan

1) ✅ Simplify Logic (see above)

2) 🔄 Create async queue

3) Adjust linkLastfmListensToTracks to increase speed of non-spotify lookups

4) Measure

Add a few users and see how quickly they get processed

5) Deliver to Prod