adambechtold commented 8 months ago

Background

Every time a user listens to a song, we receive a Last.fm Listen which is not linked to a Spotify Track.

Friction - Backfilling a new listener's history can take over 1 day

The system can keep up with the daily listens for the 50 listeners its tracking right now, but backfilling a new listener takes a very long time.

When a new listener is added, we ingest their entire listening history. This could be up to 100,000 Listens.

Right now, we can link roughly 720 unique tracks per hour. Depending on how much overlap is in a listener's history (or overlap with other new listeners being backfilled), that translates to 1k-40k events per hour. At that rate, backfilling a single listener can take over 1 day.

Bottleneck - Spotify API Calls

Our music service picks these listening events up one-by-one and performs the following steps:

1) Identify Matching Track a) Search for Exact Match by Track Name and Artist Name b) Search for a Previous Fuzzy Match* c) Search Spotify using Track Name and Artist Name** 2) Link All not-yet-researched Last.fm Listens with the same Track Name and Artist Name to This Track

*(i.e. have we previously linked a Last.fm Listen with this same track name and artist name) **(this is where fuzzy matches from from, Spotify's judgement)

Step 1.c (Search Spotify) is the bottleneck

Spotify's rate limit is low and sometimes unpredictable (see Appendix). When we get rate limited, we get rate limited very hard (e.g. wait >16 hours). To avoid this, we run the entire system very conservatively (i.e. the 720 req/hour).

Approaches

Approach - Identify Listen Events that Require Spotify and Process Them in a Separate Queue

Right now, we throttle the whole system based on Spotify's rate limit. So, even if most of the listening events are getting matched in our own database, we still keep the system running slowly. This avoids the risk of getting throttled by Spotify, but many listening events that could be process immediately must wait.

Variant - Put Listen Events that Require Spotify in a queue

Identify Matching Track Internally 1) Search Exact Match 2) Search Previous Fuzzy Match 3) Link or Defer

Condition - Track Found
- Link all Pending, Matching Listening Events
Condition - Track Not Found
- Put in Search Spotify Queue

Search Spotify 1) Pull from Search Spotify Queue 2) Search Spotify 3) Link Track

Condition - Track Found
- Link all Pending, Matching Listening Events
Condition - Track Not Found
- Mark All Pending, Matching Listening Events as Analyzed

Storage Options

AWS SQS
In-memory Queue
On the Listens Table (e.g. add a status)

Approach - Profile Spotify's Rate Limit and Increase Throughput to that Limit

The 720 requests per hour is a conservative limit. We could likely increase the speed of the entire system up to the actual limit, but we've found Spotify's limit to be unpredictable and inconsistent with its documentation (see appendix).

We could profile Spotify to find the actual limit.

▲ Pro - Simple, does not increase the complexity of the system
▼ Con - Not resilient to a change (documented or not documented) in Spotify's rate limit policy
▼ Con - Does not speed up the ability to link tracks with information in the database

➡️ Proposal - Process Spotify Events in a Separate Queue; Then determine if increasing Spotify processing speed is required

Isolating listening events that require Spotify...

...completely removes the restriction of Spotify api calls on events that can be linked with our own data
- 📍 Position - Adam: The value of this will grow significantly as our own dataset grows to cover more and more of Spotify
...isolates Spotify events into a queue that can be further optimized

Appendix

Spotify's Rate Limit doesn't seem to match its documentation

Resource - Spotify's Documentation - Rate Limits

Claim - Spotify considers a rolling 30 second window when considering the rate limit
- 👀 Observation - Spotify will let us make many, many requests for a long time before throttling us. When it does throttle us, it's retry_after header is very large (e.g. 16 hours)

Questions

❓ Question - Is fixing this worthwhile?
❓ Question - During a backfills, what portion of listen events do not require Spotify?

adambechtold commented 8 months ago

Whiteboard Link: https://link.excalidraw.com/l/ATbTqvQ19d9/2v0lsO9nZUt

🙅‍♂️ Friction - The current logic is confusing and muddy

The current approach is weird.

The side effects of the function are unclear

The function getTrackFromLastfmListenId doesn't just get the Track, it also stores the track and links it to the last.fm listen.

An argument could be made that this is a kind of "transparent caching", but the scope of the function is just so large and the caching has a kind of permanent affect on the data.

It feels different than the kind of caching we do for playlist generation. That cache is an in-memory data store, and it just stores the response. It's not modifying persistent data.

The scope of the side effects is buried

The input is a single last.fm listen id, but it affects all last.fm listens with the same trackName and artistName. That's not clear at all from the function name getTrackFromLastfmListenId

There are no guarantees against race conditions

We select one last.fm listen at a time, but the call may change many last.fm listens that have the same trackName and artistName. There's nothing stopping us from operating over two last.fm listens at the same time the same name. It's not clear that this would cause issues, but it doesn't seem safe at first glance.

adambechtold commented 8 months ago

📈 Projected Benefits - 90% of listens will no longer be blocked; 25% reduction in queries to spotify

It's not clear what this will translate to in terms of events processed per hour, but it seems reasonable that it would a worthwhile chunk larger. Determine the exact figure would require more effort than seems worthwhile (especially given the non-performance related code-cleanliness that will come with refactoring this to make the logic more clear)

adambechtold commented 8 months ago

Step 1 - Simply Logic

Without making any performance enhancements, can we make the system more clear, so that it's easier to optimize?

Here's an approach—

🔒 The scope is more clear - All last.fm listen events that will be affected are found and locked up front
🙅‍♂️🤫 - No more secret side effects - each function does what it says it will do. "getter" functions don't modify underlying data

adambechtold commented 8 months ago

Step 2 - Optimize with an Async "Queue"

Next, let's optimize.

Separate analysis into two cron jobs

🤖 Cron Job - `linkLastfmListensToTracks`

1) Get and Lock the next set of last.fm listen events

find the next last.fm listen even that has not been analyzed
get all other last.fm listen events that have not been analyzed and are equivalent (same trackName and artistName)
lock these 2) Search the database for a matching Track 3) if found, link track to all the last.fm listens 4) if not found, upsert trackName and artistName into the SpotifySearchQueue table
⚖️ Consideration - If we have already made this spotify search before, this will have no effect on the table 5) Mark the last.fm listens as analyzed 6) Unlock all the last.fm listen events

🤖 Cron job - Search Spotify for Track and Link to Listens

1) Get the next spotify query from the queue 2) Search spotify for the track 3) If found, add Track to the database 4) Store a reference to the track we created on the queue 5) Get all last.fm listens that have this trackName and artistName, and are not linked to a track 6) Link these to the track

📍 Position - Adam: This seems like it has a little too much going on. An alternative approach is to simply mark those last.fm listens as not analyzed, so that they get analyzed again by the previous cron job, but this time, the Track is there so they can be linked. I just don't like that we lose the history :/

Table - Spotify Track Search Queue

Field	Type	Contraints
id	bigint	autoincrement, unique
trackName	string
artistName	string
createdAt	datetime
searchedAt	datetime
trackId	id	foreign key to `track`

UNIQUE KEY query_combination (trackName, artistName) (this creates a composite index, so enforcing the constraint is fast)

Appendix

❓ Question - How to represent the queue?

Approach - two datetimes

searchedLocalDbAt searchedSpotifyAt

▲ Pro - I can see when that analysis is happening

Approach - enum

A single field like "analysisStatus"

▼ Con - I can't see the performance of each system ▲ Pro - Seems simpler

✅ Decision - Have a separate table where things get processed

table - last.fm listens

table - spotify search queue

trackName
artistName
createdAt
searchedAt
trackId

▲ Pro - Feels very explicit

we are queuing up track search queries, this makes that explicit
the query we'll make isn't specific to the last.fm listen
- we avoid a kind of false specificity ▲ Pro - The search history on spotify is very clear
- search spotify, for a previous match, we could use this table instead of the last.fm listen table

Approach - SQS or similar

▼ Con - Harder to run locally ▼ Con - Harder to observe ▼ Con - More expensive (like $0.40/mo...) ▲ Pro - Less stress on the database

⚖️ Consideration - We could change the logic of the previous search function to use the `Spotify Track Search Queue` table.

▲ Pro - A more explicit implementation ▼ Con - Ties us to using the the history search queue for match lookups, which might change (e.g. maybe one day the queue is more ephemeral, so we can't rely on it) ▼ Con - More work ▼ Con - Doesn't have a clear enough benefit; This doesn't deliver any top-line benefit. It's just a code-quality code-logic change. It's a good idea to record, but doesn't offer much tangible benefit right now ✅ Decision - Store this as a consideration in a comment in the code for future readers

adambechtold commented 8 months ago

Development Plan

1) ✅ Simplify Logic (see above)

2) 🔄 Create async queue

[x] create table
[x] adjustgetTrackByName... logic to exclude a spotify search
[ ] create search spotify cron job

3) Adjust `linkLastfmListensToTracks` to increase speed of non-spotify lookups

4) Measure

Add a few users and see how quickly they get processed

adambechtold / taste-explorer

Optimization - Create Listen from Lastfm Listen / Separate Queue for Track Names Not Yet Researched #16

Background

Friction - Backfilling a new listener's history can take over 1 day

Bottleneck - Spotify API Calls

Step 1.c (Search Spotify) is the bottleneck

Approaches

Approach - Identify Listen Events that Require Spotify and Process Them in a Separate Queue