Open adambechtold opened 8 months ago
Whiteboard Link: https://link.excalidraw.com/l/ATbTqvQ19d9/2v0lsO9nZUt
The current approach is weird.
The function getTrackFromLastfmListenId
doesn't just get the Track
, it also stores the track and links it to the last.fm listen.
An argument could be made that this is a kind of "transparent caching", but the scope of the function is just so large and the caching has a kind of permanent affect on the data.
It feels different than the kind of caching we do for playlist generation. That cache is an in-memory data store, and it just stores the response. It's not modifying persistent data.
The input is a single last.fm listen id
, but it affects all last.fm listens
with the same trackName
and artistName
. That's not clear at all from the function name getTrackFromLastfmListenId
We select one last.fm listen
at a time, but the call may change many last.fm listens
that have the same trackName
and artistName
. There's nothing stopping us from operating over two last.fm listens
at the same time the same name. It's not clear that this would cause issues, but it doesn't seem safe at first glance.
It's not clear what this will translate to in terms of events processed per hour
, but it seems reasonable that it would a worthwhile chunk larger. Determine the exact figure would require more effort than seems worthwhile (especially given the non-performance related code-cleanliness that will come with refactoring this to make the logic more clear)
Without making any performance enhancements, can we make the system more clear, so that it's easier to optimize?
Here's an approach—
Next, let's optimize.
linkLastfmListensToTracks
1) Get and Lock the next set of last.fm listen events
trackName
and artistName
)Track
3) if found, link track to all the last.fm listens
4) if not found, upsert trackName
and artistName
into the SpotifySearchQueue
tablelast.fm listens
as analyzed
6) Unlock all the last.fm listen
events1) Get the next spotify query from the queue 2) Search spotify for the track 3) If found, add Track to the database 4) Store a reference to the track we created on the queue 5) Get all last.fm listens that have this trackName and artistName, and are not linked to a track 6) Link these to the track
📍 Position - Adam: This seems like it has a little too much going on. An alternative approach is to simply mark those last.fm listens as not analyzed, so that they get analyzed again by the previous cron job, but this time, the Track is there so they can be linked. I just don't like that we lose the history :/
Field | Type | Contraints |
---|---|---|
id | bigint | autoincrement, unique |
trackName | string | |
artistName | string | |
createdAt | datetime | |
searchedAt | datetime | |
trackId | id | foreign key to track |
UNIQUE KEY query_combination (trackName, artistName)
(this creates a composite index, so enforcing the constraint is fast)searchedLocalDbAt searchedSpotifyAt
A single field like "analysisStatus"
▼ Con - I can't see the performance of each system ▲ Pro - Seems simpler
table - last.fm listens
table - spotify search queue
▲ Pro - Feels very explicit
▼ Con - Harder to run locally ▼ Con - Harder to observe ▼ Con - More expensive (like $0.40/mo...) ▲ Pro - Less stress on the database
Spotify Track Search Queue
table.▲ Pro - A more explicit implementation ▼ Con - Ties us to using the the history search queue for match lookups, which might change (e.g. maybe one day the queue is more ephemeral, so we can't rely on it) ▼ Con - More work ▼ Con - Doesn't have a clear enough benefit; This doesn't deliver any top-line benefit. It's just a code-quality code-logic change. It's a good idea to record, but doesn't offer much tangible benefit right now ✅ Decision - Store this as a consideration in a comment in the code for future readers
getTrackByName...
logic to exclude a spotify searchlinkLastfmListensToTracks
to increase speed of non-spotify lookupsAdd a few users and see how quickly they get processed
Background
Every time a user listens to a song, we receive a
Last.fm Listen
which is not linked to a Spotify Track.Friction - Backfilling a new listener's history can take over 1 day
The system can keep up with the daily listens for the 50 listeners its tracking right now, but backfilling a new listener takes a very long time.
When a new listener is added, we ingest their entire listening history. This could be up to 100,000 Listens.
Right now, we can link roughly 720 unique tracks per hour. Depending on how much overlap is in a listener's history (or overlap with other new listeners being backfilled), that translates to 1k-40k events per hour. At that rate, backfilling a single listener can take over 1 day.
Bottleneck - Spotify API Calls
Our music service picks these listening events up one-by-one and performs the following steps:
1) Identify Matching Track a) Search for Exact Match by Track Name and Artist Name b) Search for a Previous Fuzzy Match* c) Search Spotify using Track Name and Artist Name** 2) Link All not-yet-researched Last.fm Listens with the same Track Name and Artist Name to This Track
*(i.e. have we previously linked a
Last.fm Listen
with this same track name and artist name) **(this is where fuzzy matches from from, Spotify's judgement)Step 1.c (Search Spotify) is the bottleneck
Spotify's rate limit is low and sometimes unpredictable (see Appendix). When we get rate limited, we get rate limited very hard (e.g. wait >16 hours). To avoid this, we run the entire system very conservatively (i.e. the 720 req/hour).
Approaches
Approach - Identify Listen Events that Require Spotify and Process Them in a Separate Queue
Right now, we throttle the whole system based on Spotify's rate limit. So, even if most of the listening events are getting matched in our own database, we still keep the system running slowly. This avoids the risk of getting throttled by Spotify, but many listening events that could be process immediately must wait.
Variant - Put Listen Events that Require Spotify in a queue
Identify Matching Track Internally 1) Search Exact Match 2) Search Previous Fuzzy Match 3) Link or Defer
Search Spotify 1) Pull from Search Spotify Queue 2) Search Spotify 3) Link Track
Storage Options
Approach - Profile Spotify's Rate Limit and Increase Throughput to that Limit
The 720 requests per hour is a conservative limit. We could likely increase the speed of the entire system up to the actual limit, but we've found Spotify's limit to be unpredictable and inconsistent with its documentation (see appendix).
We could profile Spotify to find the actual limit.
➡️ Proposal - Process Spotify Events in a Separate Queue; Then determine if increasing Spotify processing speed is required
Isolating listening events that require Spotify...
Appendix
Spotify's Rate Limit doesn't seem to match its documentation
Resource - Spotify's Documentation - Rate Limits
retry_after
header is very large (e.g. 16 hours)Questions