The Spotify object and memory usage

SHxKM commented 4 years ago

Given that this is now the maintained version, I’m posting this question here. Thanks @Harrison97 for taking the initiative.

I’m pretty certain that I am at least somehow misusing this library. The reason I think so is that when I need to process especially large Spotify libraries, my memory usage spikes dramatically. Notice that I process libraries of the same size for Apple Music, and I don’t get the same issue.

So, basically what I’m doing is authenticating the user with a refresh token so I can do background refresh tasks on their behalf. After successfully authenticating and getting the tokens I’m using spotipy to authenticate and get the spotipy object. I then pass this object around through multiple functions that wrap spotipy functionality: scan artists, libraries, pagination, etc...

Is this the correct way to do stuff? Again, I’m doing basically the exact same thing with (my own) Apple Music parser and the difference in memory usage is dramatic. Am I missing something?

deeplusplus commented 4 years ago

In general it makes sense to create a single Spotipy client object and pass it around for use by various functions/collaborators. But it's hard to say without knowing exactly what your current implementation is, what behavior you are observing, and what your desired/expected behavior is.

SHxKM commented 4 years ago

@deeplusplus Thanks for your help. I'll try to be more specific.

what your current implementation is

Authorize the user

def authorize_user(user):
    ...
    spotify = spotipy.Spotify(auth=new_user_token)
        return spotify

Then, using the spotify object we got, get all artists in his library (including artists appearing in tracks), example:

def get_spotify_followed(spotify_object):
    d = spotify_object.current_user_followed_artists(limit=50)
    all_res = paginate_spotify(spotify_object, d, chosen_key="artists")
    final_list = []
    for item in all_res["items"]:
        final_list.append({"name": item["name"], ....})

what behavior you are observing

For libraries with 1,000 artists and above, I'm seeing a steep increase in memory for my celery process. I'm talking about a close to 100MB increase per a single library scan. I'm aware that we are traversing 20 (1000/50) JSON responses and possible over 1,000 dicts, but my own implementation of the Apple Music API chews through libraries as large as 9,000 artists without such a huge memory footprint.

and what your desired/expected behavior is.

With the current memory usage pattern, I won't be able to scale efficiently. Hence why I posted here, possibly for advice.

deeplusplus commented 4 years ago

Hmmm... It's still hard to say specifically what's going wrong. From a high level I wonder about a two things.

Your call to get the current user's followed artists looks reasonable. I do wonder what's happening inside the paginate_spotify call. I'm imagining that it's using the same method to get the user's followed artists but with additional logic to leverage the after parameter to create a final list of all a user's followed artists. Taking a look in there would be particularly interesting because that's where all the resource intensive work of making additional requests, collecting the responses, and accumulating them into a single result is happening.
You mention your celery process which, after a cursory internet search, sounds like a threading framework. Depending on the implementation threading can also be resource intensive. It's hard to say without knowing more but I would poke around in this area as well to verify that the processes are configured in a resource conscientious way.

stephanebruckert commented 4 years ago

Possible fix https://github.com/plamere/spotipy/pull/269

Harrison97 / spotipy-plamere

The Spotify object and memory usage #6