quentinguidee commented 4 months ago

While requests are performant for many use cases, I have around 225000 listened songs (23000 different) with Your Spotify hosted on a Raspberry PI, so navigate on the client takes a lot of time, especially with the "Last year" or "All" timeframe.

This PR improves performances of getBestSongsNbOffseted, getBestArtistsNbOffseted and getBestAlbumsNbOffseted by a factor of ~10. My approach was to group all Infos of the same track at start. This avoids to compute many times the same Info, which helps a lot for the lightTrackLookupPipeline.

Benchmark

done on mac m1, not on the Raspberry

A request for the top songs with date range "All" before this PR takes in my case 11,8s:

server  | GET /spotify/top/songs?start=2017-11-18T12:18:39.000Z&end=2024-02-26T14:07:27.570Z&nb=20&offset=0 200 11764.380 ms - 25715

After this PR, this time drops to 1,4s, which makes the page loads almost instantly for the biggest request.

server  | GET /spotify/top/songs?start=2017-11-18T12:18:39.000Z&end=2024-02-26T14:07:27.570Z&nb=20&offset=0 200 1363.197 ms - 25095

Update: See below for another performance boost (https://github.com/Yooooomi/your_spotify/pull/348#issuecomment-1967481756)

For reviewer

I removed the $project: { ...getGroupByDateProjection(user.settings.timezone), id: 1 } step. I'm not sure what it does exactly. However, we can add it back if necessary.
I never used MongoDB so feel free to check if some steps can be simplified. I didn't test on previous versions of Mongo, so I'm not sure if I break compatibility with Mongo 4.4 required for Raspberry Pi 4 (even if I don't see where it could happen).

232, #162

quentinguidee commented 4 months ago

Also, I think a lot of requests can beneficiate from this improvement. If everything looks fine I might try to improve the others.

quentinguidee commented 4 months ago

I just realised that the request is sorted by listening count, not total duration, so we can directly limit/offset at start, which would make the request instantaneous

Edit it seems to work (0,2s), but the total_count is not calculated correctly, so I reverted to the 1,2s version

Yooooomi commented 4 months ago

Hello! Thanks a lot for what you've done. It seems good to me.

The facet is a bit weird since I'm not used to it. Would be interesting to test if it outputs the exact same information. This performance boost is insane thanks a lot! I don't want to abuse or anything but yeah it would be very good to have a brand new look on other queries too. Maybe (surely) getBestArtistsNbOffseted and getBestAlbumsNbOffseted would benefit from this too.

I've tried working on #162 at some point but did not find any better implementation back then.

Again, thanks a lot. Don't hesitate if you have any questions, and keep in touch when you feel like you're done with this :)

quentinguidee commented 4 months ago

Hi!

Would be interesting to test if it outputs the exact same information.

They seem to have the exact same result in my case! However, it may be good to test again before merging with other data!

I'll try to improve the PR in the next days to cover more improvements 👍

Yooooomi commented 4 months ago

I'll try to improve the PR in the next days to cover more improvements 👍

Thanks a lot

quentinguidee commented 4 months ago

With the same technique, https://github.com/Yooooomi/your_spotify/pull/348/commits/5669865099a6981c83c03c2d88bc878b0b56f6f6 makes the /top/artists request faster, from 12.1s to 1.3s

For this request however:

I removed track and album from the result. I don't think it makes sense since an artists doesn't have only one track/one album, so it seems safe to remove;
~~I also removed the differents field, it seems to come from other requests like differentArtistsPer.~~ I found the usage, I added it back

quentinguidee commented 4 months ago

And to conclude, /top/albums from 12s to 1.4s 🚀

quentinguidee commented 4 months ago

Should be ready for review! 🚀

RomainNeup commented 4 months ago

Hello @quentinguidee, congrats for the optimization it's pretty impressive 👏 If you have time, could you please have a look at the new mongoDb requests added in my PR https://github.com/Yooooomi/your_spotify/pull/346 to see if it can also be optimized ?

quentinguidee commented 4 months ago

@Yooooomi I have a MongoDB question. I think that almost every requests in this project could be instantaneous if we could have duration_ms, track_id, album_id, artists_id directly indexed in the InfosModel. I don't see another way to do this than adding these fields to all InfosModel. Do you know any way to achieve this automatically with MongoDB?

Yooooomi commented 4 months ago

I can write a migration that adds these fields in the info model. I'll be working on it later today if this can help you.

quentinguidee commented 4 months ago

Thanks! That would be helpful. Also this might open the path for more precise total duration when storing each listening duration separately

Yooooomi commented 4 months ago

Also this might open the path for more precise total duration when storing each listening duration separately

The Spotify API does allow to get the specific listening duration for the recently played tracks.

quentinguidee commented 4 months ago

Yeah unfortunately, but this might be approximated with the time elapsed between two entries in the history? And by default set it to the track duration?

Also, iirc the spotify data when downloaded include the duration of each listening

Yooooomi commented 4 months ago

Spotify history is sometimes not precise, relying on timestamps to establish listening duration might be dangerous. I'm pushing the migration to release/1.8.0 branch. You will be able to rebase/merge from there to have access to albumId, artistId and durationMs in the Infos model.

Yooooomi commented 4 months ago

It's pushed to release/1.8.0

quentinguidee commented 4 months ago

@Yooooomi It seems that the migration does not trigger, is that because the timestamp of the migration is wrong? (probably the duplication of vscode which incremented the timestamp by one). I tried to npm migrate manually but it says its migrated (but the infos don't have the new fields)

Yooooomi commented 4 months ago

It did trigger on my instance. Are you sure you have the latest code built?

quentinguidee commented 4 months ago

Oh the migration was marked as done in the collection "migrations", but was not done for a reason I don't understand. I did a mongorestore and restarted, and it worked

quentinguidee commented 4 months ago

Yeah that works really well, I achieve a request of ~300ms instead of 12.1s with the new fields 🚀 And the requests are also simpler in the code (we can extract almost everything to a new function)

  InfosModel.aggregate([
    { $match: basicMatch(user._id, start, end) },
    {
      $group: {
        _id: "$id",
        duration_ms: { $sum: "$durationMs" },
        count: { $sum: 1 },
      },
    },
    {
      $facet: {
        infos: [
          { $sort: { count: -1, _id: 1 } },
          { $skip: offset },
          { $limit: nb },
        ],
        computations: [
          {
            $group: {
              _id: null,
              total_duration_ms: { $sum: "$duration_ms" },
              total_count: { $sum: "$count" },
            },
          },
        ],
      },
    },
    { $unwind: "$infos" },
    { $unwind: "$computations" },
    {
      $project: {
        _id: "$infos._id",
        result: {
          $mergeObjects: ["$infos", "$computations"],
        },
      },
    },
    {
      $replaceRoot: {
        newRoot: {
          $mergeObjects: ["$result", { _id: "$_id" }],
        },
      },
    },
    { $lookup: lightTrackLookupPipeline("_id") },
    { $unwind: "$track" },
    { $lookup: lightAlbumLookupPipeline() },
    { $unwind: "$album" },
    { $lookup: lightArtistLookupPipeline() },
    { $unwind: "$artist" },
  ]);

The $facet is a way to make two different data process from the same data source. So the first part takes only the most listened tracks, while the second takes the total count and duration from all tracks. Then both are merged

Yooooomi commented 4 months ago

Nice job! I don't remember why we compute total_duration_ms. This feels weird when we just take a subpart of the whole thing.

quentinguidee commented 4 months ago

I think this is for the percentages here:

But if we remove them the request should be extremely fast

Yooooomi commented 4 months ago

I feel like we could do this in a request on the side. So that this request does not have to compute this for every page.

Yooooomi commented 4 months ago

I just pushed a new file architecture to the branch so that it follows modern monorepo architecture. You might encounter issues. I would suggest you merge the 1.8.0 to your branch to see if there are any merge conflicts.

quentinguidee commented 4 months ago

Thanks! Everything looks fine, the 3 requests are almost instantaneous now. I have also refactor the 3 to use the same base request. I think the "differents" field and some lookup can be removed depending on the request (tracks/artists/albums) but that doesn't change the time so clean code is better imo for now while we recreate the requests

quentinguidee commented 4 months ago

I'll keep this PR restricted to these 3 methods for now so we can iterate faster, so this one is ready for review 🧪

Yooooomi commented 4 months ago

Hello, this seems fine to me. Does it change the signature fontend? It seems the objects are not looked up anymore so the frontend cannot display information am I right? Edit : didn't see the lookup pipeline my bad. Seems fine to me, going to merge this evening to test.

Yooooomi commented 4 months ago

OK OK OK this is BLAZING fast. Got from 20.48s down to 0.81s on M1 Pro, I'll update this with numbers from intel i7. Infos mongo collection is 10 times bigger though but still only 41MB (32MB of indexes). Huge work thanks a lot. Can't wait to see how it can be applied to other requests. Many users will thank those performance improvements.

quentinguidee commented 4 months ago

Yeah that changes entirely the feeling of the app! The tradeoff with document databases compared to relational db is that join requests are really bad and should be avoided as much as possible. Without that the performances are extremely good! I'll work on the other queries really soon, I have a lot of free time at the moment.

For the 30MB of indexes yeah that was predictable, but I think it is nothing compared to the space it took in RAM during the giant request (but I don't have benchmarks for that).

Yooooomi / your_spotify

Improve Songs/Artists/Albums query performances by ~50 #348

Benchmark

For reviewer

Related

232, #162