Kotlin / kotlinx.coroutines

Library support for Kotlin coroutines
Apache License 2.0
13.06k stars 1.85k forks source link

Proposal: Flow#distinctBy #2806

Open Masterzach32 opened 3 years ago

Masterzach32 commented 3 years ago

Both List and Sequence have the distinctBy extension function

Example implementation:

fun <T> Flow<T>.distinct(): Flow<T> = distinctBy { it }

fun <T, K> Flow<T>.distinctBy(selector: (T) -> K): Flow<T> = flow {
    val keySet = mutableSetOf<K>()
    collect { value ->
        if (keySet.add(selector(value)))
            emit(value)
    }
}

Is there a reason that the Flow API does not have this? Would there be interest in a function like this in the coroutines library? I have several use cases for this in my projects, I'd assume others might find it useful as well. I could make a PR for this if there is interest.

qwwdfsad commented 3 years ago

Could you please elaborate on the exact use case of the operator?

Implementing it is pretty straightforward indeed, but it also can be a bit problematic due to unbounded memory consumption, especially for potentially infinite flows.

We are not aiming to be a complete sequence replacement here, so it would be really nice to be sure it actually has its uses before adding it

Masterzach32 commented 3 years ago

Sure thing.

One use case in my Discord bot is to get the artists from one or multiple spotify playlists. Each playlist can have multiple tracks, and each track can have multiple artists. Each playlist can also have multiple tracks from the same artist. For example:

api.browse.getFeaturedPlaylists(market = Market.US)
        .getAllAsFlowNotNull() // Flow<SimplePlaylist>
        .mapNotNull { api.playlists.getPlaylist(it.id, Market.US) } // Flow<Playlist>
            // get tracks from playlists, emitted in the flow in batches of 50.
        .flatMapConcat { it.tracks.getAllAsFlowNotNull() } // Flow<PlaylistTrack>
        .mapNotNull { it.track?.asTrack } // Flow<Track>
        .flatMapConcat { it.artists.asFlow() } // Flow<SimpleArtist>
        .distinctBy { it.id }
        .mapNotNull { api.artists.getArtist(it.id) } // Flow<Artist>

Here I use distinctBy to filter duplicate artists, before getting the full Artist object from the spotify API. This helps reduce the total amount of HTTP requests

chachako commented 3 years ago

I agree with this operator, sometimes I don't want to collect duplicate data.

dkhalanskyjb commented 3 years ago

@RinOrz, sure, but the only way to avoid it completely is to store every distinct result ever emitted, which can consume a lot of memory.

The use case above could also be served, for example, by a version of distinctBy that doesn't store everything but only remembers, let's say, a hundred entries. Most HTTP requests would still be avoided, but we would have a hard limit on the amount of memory used by this operator, which is… probably good? Or maybe not and people do need distinctBy that uses unlimited memory much more often? Who can tell?

This is to say that you can help the design process if you describe specifically when and why you want to avoid collecting duplicates.