Kotlin / KEEP

Kotlin Evolution and Enhancement Process
Apache License 2.0
3.43k stars 362 forks source link

Sliding window functions #11

Closed cy6erGn0m closed 6 years ago

cy6erGn0m commented 8 years ago

Discussions about the Sliding window functions will be held here.

ilya-g commented 8 years ago

Use cases:

fun view(page: Int) = source.slidingWindow(pageSize).drop(page).firstOrNull()

  • it might be better to use elementAtOrNull(n) instead of drop(n).firstOrNull()
  • I wouldn't call this an effective implementation for view(page), as it requires to populate and then throw away all pages before the required one. Might be better just drop(pageSize * page).take(pageSize)

permutations generation, advanced signal processing, hex to byte array

  • windows returned could have size < windowSize, it would break these usages.
ilya-g commented 8 years ago

Alternatives:

voddan commented 8 years ago

what slidingWindowIndices is supposed t do? What is the signature?

cy6erGn0m commented 8 years ago

It returns a sequence of ranges rather than sub-contents. Not sure we have to make it public, most likely we shouldn't

voddan commented 8 years ago

@cy6erGn0m so it is equivalent to this.indices.slidingWindow() ? I don't think this is worth a separate function.

cy6erGn0m commented 8 years ago

actually it is equivalent to indices.slidingWindow(..).map { IntRange(it.first(), it.last()) } that in fact is too inefficient. The only use-case for this function is the case when you don't want to copy sub contents but just want to keep index ranges for performance reasons while indices.slidingWindow(...) is definitely not a good idea if you need performance.

voddan commented 8 years ago

Yes, now I see the benefit of this function.

What troubles me is that slidingWindowIndices does not depend on the content of the collection.

The use case you described may be covered with slidingWindow overloaded for ClosedRange<T> (IntRange) with a usage like: list.indices.slidingWindow()

cy6erGn0m commented 8 years ago

Yes, but it leads to an ambiguity: IntRange is both ClosedRange and Iterable<Int> so having two functions with different return types could cause confusion

ilya-g commented 8 years ago

TODO: Elaborate use-cases, remove unnatural ones (paging, bytesFromHex).

cy6erGn0m commented 8 years ago

@ilya-g I don't agree that bytesFromHex is unnatural: I do like this in tests (with simplified implementation of slidingWindow) as it is one of the easiest ways to do that.

ilya-g commented 8 years ago

Regarding byteFromHex, I would write it as follows:

fun String.hexToBytes(): ByteArray {
    require(this.length % 2 == 0) { "Must contain even number of hex digits." }
    val size = this.length / 2 
    return ByteArray(size) { p -> (Character.digit(this[p*2],16)*16 + Character.digit(this[p*2+1], 16)).toByte() }
}

It might look not so clear as with slidingWindow, but it's definitely more performant.

Again why it's not the use-case of this proposal — if one need to convert base16 string to a ByteArray, he would use some byteFromHex function rather than resort to ad-hoc slidingWindow every time.

cy6erGn0m commented 8 years ago

@ilya-g in your implementation there is too much arithmetic

Of course there are a lot of implementations available, for example, apache commons codec has one for that purpose. The main idea of the example is to show text-processing use-case. Don't get any ideas about text formats with fixed-size fields, don't you?

ilya-g commented 8 years ago

Don't get any ideas about text formats with fixed-size fields, don't you?

No, I don't. It must be crystal clear from the proposal without guessing.

ilya-g commented 8 years ago
hastebrot commented 8 years ago

Similar API: For Ruby there is not only Enumerable.each_slice(n) but also Enumerable.each_cons(n).

hastebrot commented 8 years ago

Also similar API: For Clojure there is partition in clojure.core: for example (partition n coll) and (partition n step pad coll). step defaults to n. https://clojuredocs.org/clojure.core/partition

For Python ActiveState has some (maybe inefficient) recipes for each_cons() and each_slice() by Hamish Lawson and Brian Quinlan: http://code.activestate.com/recipes/347689-ruby-arrayeach_cons-each_slice-overlapping-and-non/

hastebrot commented 8 years ago

For educational purposes I implemented some variants for fixed and sliding windows [1]. I know there is already a prototype implemented (refered in the KEEP document) with test suite which even uses a RingBuffer.

[1] https://gist.github.com/hastebrot/9b0532d9b9317f13d93aeb79c24125b4

A sliding window for pairs with an offset of one, can be implemented ad hoc with this simple variant which uses zip(). This however does not work with sequenceOf(...).constrainOnce():

fun main(args: Array<String>) {
    val itemSeq = sequenceOf(1, 2, 3, 4, 5) //.constrainOnce()
    val itemPairs = itemSeq.zip(itemSeq.drop(1))
    val itemPairPairs = itemSeq.zip(itemSeq.drop(1)).zip(itemSeq.drop(2))

    println(itemSeq.toList()) // [1, 2, 3, 4, 5]
    println(itemPairs.toList()) // [(1, 2), (2, 3), (3, 4), (4, 5)]
    println(itemPairPairs.toList()) // [((1, 2), 3), ((2, 3), 4), ((3, 4), 5)]
}
ilya-g commented 8 years ago

Regarding to alternatives section: "copy-paste it everywhere" is not a meaningful alternative. I mean we can assume such an alternative for every standard library function proposal.

ilya-g commented 8 years ago

slidingWindow2 is quite useful, usually it comes in form of something like pairwise or so. I think it deserves its own proposal.

On the opposite slidingWindow3 seems to be a specialization of general slidingWindow for which I don't see compelling use cases.

hastebrot commented 8 years ago

@ilya-g slidingWindow3 can be used to implement the Visvalingam-Whyatt polyline simplification algorithm [1]. In the algorithm in a list of points we use a sliding window of three consecutive points to build a triangle.

[1] https://bost.ocks.org/mike/simplify/

But it seems this doesn't sound like a compelling use case compared to pairwise.

ilya-g commented 8 years ago

relates to dropTrailing: http://stackoverflow.com/questions/7958712/inconsistent-behaviour-for-xs-slidingn-if-n-is-less-than-size Also methods of GroupedIterator: withPadding and withPartial http://www.scala-lang.org/api/2.11.7/#scala.collection.Iterator$GroupedIterator

ilya-g commented 8 years ago

Use case for slidingWindow(size, step=1): https://github.com/jedrz/cdf/blob/318668f7e2304ebb1b08b9c6db15c7c158b38f5a/src/main/scala/cdf/matcher/ngrams/NGramsEvaluator.scala Computing trends: https://github.com/unicredit/trafalgar/blob/6ccbc19a267df09304b5946ce60bc180d7690101/src/main/scala/eu/unicredit/trafalgar/Rules.scala

ilya-g commented 7 years ago

Similar API: Groovy's collate http://mrhaki.blogspot.ru/2012/04/groovy-goodness-collate-list-into-sub.html http://docs.groovy-lang.org/latest/html/groovy-jdk/java/lang/Iterable.html#collate(int)

Groostav commented 7 years ago

I feel its worth mentioning, we work with 2-dimensonal matricies through JNA from native code, meaning we get a lot of linearized 2D things --that is to say, a 2D grid converted into a 1D array by "stacking" rows one-after-the-other in an array.

The standard strategy to convert them is

val columnCount: Int = ...
val input: Array<Double> = someNativeFunction()
val matrix: List<List<Double>> = input.groupByIndexed { elem, index -> index / columnCount }
//this assumes row-major, you can use column-major by using mod `%` instead of div `/`

note that groupByIndexed is not part of the STD lib, and is a function we wrote --we also use sequences a lot as these matrices are often quite large.

Can we alleviate a large portion of the use case for this function with a simple groupByIndexed operator in the standard lib?

ilya-g commented 7 years ago

@Groostav groupBy returns a map, it would be strange and contradictory to other *Indexed functions if groupByIndexed returned a list of lists.

Groostav commented 7 years ago

quite right @ilya-g , thus

fun <T, K> Iterable<T>.groupByIndexed(keySelector: (index: Int, T) -> K): Map<K, List<T>> 
fun <T, K> Sequence<T>.groupByIndexed(keySelector: (index: Int, T) -> K): Map<K, List<T>>

which would also satisfy my goals, if requiring a bit of reshaping for the output value.

ilya-g commented 7 years ago

I've revised the proposal, now there are three sets of functions considered: chunked to partition in non-overlapped blocks, window to partition in possibly overlapping windows or windows with gaps, and pairwise as a special case of window of size 2 shifted by 1 element to partition to pairs of adjacent elements.

ilya-g commented 7 years ago

fun <T, K> Iterable<T>.groupByIndexed(keySelector: (index: Int, T) -> K): Map<K, List<T>> fun <T, K> Sequence<T>.groupByIndexed(keySelector: (index: Int, T) -> K): Map<K, List<T>>

@Groostav it's an interesting idea, but it's out of scope of current proposal. You can start a new proposal instead.

mfulton26 commented 7 years ago

s/pairwise/windowedPairwise/?

When I read function name "pairwise" I don't know if it takes each pair one by one (chunked pairwise) or if it windows it (size of 2, step 1) or if it generates all-pairs (Cartesian product). If the name were changed to "windowedPairwise" then it would be immediately clear, I think, that this is a special case of "windowed" without even needing to read the documentation.

Thoughts?

voddan commented 7 years ago

When I first read fun chunked( and fun windowed(, I thought I saw chuckled and widowed 0_o

I am not a native english speaker, but it seems to me that certain rhythm is missing from those two words, which is why they sound a little unnatural.

mfulton26 commented 7 years ago

I also wonder: Should this include implementing forEachChunk, forEachWindow, and forEachPair extension functions to iterate over the desired combinations without actually creating intermediate lists?

ilya-g commented 7 years ago

@mfulton26 That's how the overloads taking a lambda work.

mfulton26 commented 7 years ago

That's right. :man_facepalming: Thank you @ilya-g.

marstran commented 7 years ago

I find these functions very useful. One question: I keep thinking that the step parameter of the windowed function should have a default value of 1. Is this something that has been considered?

voddan commented 7 years ago

Why pairwise is windowed(2, 1), not chunked(2)?

As @mfulton26 mentioned, I would expect a different behaviour from this function judging by its name.

Is there any evidence that one is needed more often than another?

voddan commented 7 years ago

I checked it on the internet, and it seems that "pairwise" is generally used to describe all possible combinations of elements, not subsequent pairs of elements.

The only place where "pairwise" means the same as in the proposal is the Python library, possibly because they have a combinations function as well.

I like the @marstran suggestion to add a default value of 1 to the step parameter of the windowedfunctions. If we implement it, then pairwise will be synonymous to windowed(2). That would make pairwise unnecessary and save us a problem of coming up with a clearer name for this function.

ilya-g commented 7 years ago

I checked it on the internet, and it seems that "pairwise" is generally used to describe all possible combinations of elements, not subsequent pairs of elements.

@voddan, could you share your findings? It would be especially useful if you could find an analogous function in some other framework, so we could compare the naming.

If we implement it, then pairwise will be synonymous to windowed(2). That would make pairwise unnecessary and save us a problem of coming up with a clearer name for this function.

pairwise has some performance advantages over windowed(2), so it cannot be replaced by the latter: it doesn't allocate a list to represent a window; also being a simplified case of windowed it has an implementation more suitable to be inlined. Therefore pairwise is inline, while windowed is not.

ilya-g commented 7 years ago

I find these functions very useful. One question: I keep thinking that the step parameter of the windowed function should have a default value of 1. Is this something that has been considered?

@marstran Initially we had both windowed and chunked as a single method with step defaulted to window size, but after splitting it in two methods it became possible to make step default to 1.

The pros of that should be weighed carefully, as it would double the method count occupied by windowed. In case if we add another default parameter to windowed, like dropPartialWindows = true, it would be possible also to make step optional.

Is step = 1 so common that it is a reasonable default?

voddan commented 7 years ago

@voddan, could you share your findings?

I searched StackOverflow for "pairwise" and looked through the questions and how they were using the word. Excluding questions about Python, the majority of questions use "pairwise" to describe each-to-each relationships.

I wasn't able to find any major framework that would use a "pairwise" function except for the Python library, possibly because of the controversy. Sometimes they use it in documentation to describe a pairwise comparison (which is each-to-each).

The wikipedia says that "pairwise" generally means "occurring in pairs" or "two at a time", but in math the term everybody uses is pairwise comparison, which is drastically different. I believe programmers prefer to use the mathematical meaning of this word.

marstran commented 7 years ago

I looked at some other languages to see what they do with functions like these:

Scala has the sliding function which is equivalent to the windowed function. This function has an overload with step = 1.

Ruby has the each_cons function which is equivalent to windowed with step = 1. I couldn't find a function in Ruby where you can provide the step parameter, but you have each_slice which is equivalent to chunked.

F# has the windowed function which is equivalent to Kotlin's windowed function with step = 1. It doesn't take a step parameter. By the way, F# also has the pairwise function which is equivalent to Kotlin's pairwise.

Looking at all these, I find step = 1 to be a reasonable default.

ilya-g commented 7 years ago

Alternative naming options for pairwise:

ilya-g commented 7 years ago

I've renamed pairwise to zipWithNext and changed the parameters of windowed. Now step defaults to 1 and there is a new parameter partialWindows, which controls whether to keep or drop incomplete windows in the end, defaults to false. https://github.com/JetBrains/kotlin/compare/7c2dd7c210b85bcebfc7ca79850e3abf5c18df94~2...7c2dd7c210b85bcebfc7ca79850e3abf5c18df94

voddan commented 7 years ago

Nice! One small question thou: why chunked keeps partial windows, while windowed defaults to not having them? Isn't it a bit inconsistent? I have no personal preference on this one.

ilya-g commented 7 years ago

@voddan The use case of chunked is to process a large collection in several smaller chunks. Usually you don't want the last elements of a collection to be dropped just because they form a chunk of the smaller size than requested. On the other hand window processing cases usually require to have a fixed size window, so it seems reasonable to default to dropping partial windows.

fab1an commented 7 years ago

Will there be support for "folding" function that group sequences into windows? Like having a function that determines when a window starts and another when the window ends resulting in a lazy sequence of these grouped items?

ilya-g commented 7 years ago

@fab1an We considered that function before, see https://github.com/Kotlin/KEEP/blob/master/proposals/stdlib/window-sliding.md#future-advancements, but decided that it would have to generic API and would be not much in demand.

But let's imagine: if you had such function, how would you call it? Could you show some code examples with this function?

cy6erGn0m commented 7 years ago

Btw, I tried to design and implement such function multiple times and every time it resulted in too complex declarations and difficult to understand parameters

dzharkov commented 6 years ago

The declarations have been released in Kotlin 1.2, thus I'm closing the KEEP. Do not hesitate to open a YouTrack issue for any additional suggestions.