JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.85k stars 5.49k forks source link

Range semantics #56479

Open LilithHafner opened 2 weeks ago

LilithHafner commented 2 weeks ago

Theoretical question: Is the range 0.0:0.1:1.0 A) a performance optimization that should behave identically to the vector [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] or B) does it represent the 11 real numbers evenly spaced between 0 and 1, including both endpoints?

Practical implications: Under A, diff(0:.1:1) is [0.1, 0.1, 0.09999999999999998, 0.10000000000000003, 0.09999999999999998, 0.09999999999999998, 0.09999999999999998, 0.10000000000000009, 0.09999999999999998, 0.09999999999999998] (current implementation) Under B, a performance and accuracy improvement would be to define diff(x::AbstractRange) = fill(step(x), length(x)-1) which would return [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1].

Under A, searchsorted(1f8-10:1f8, 1f8) should return 5:9 Under B, searchsorted(1f8-10:1f8, 1f8) should return 9:9 (current implementation, see #34408)

Under A, collecting a range before performing an operation should never change results Under B, a range can be more precise than the collected vector so it is possible for collect to introduce rounding errors which change results.

mbauman commented 2 weeks ago

Even better, diff(A::AbstractRange) could be implemented as r[begin:end-1] .- r[begin+1:end], which gives back the range:

julia> rangediff(r) = r[begin:end-1] .- r[begin+1:end]
rangediff (generic function with 4 methods)

julia> rangediff(0:.1:1)
StepRangeLen(-0.1, 0.0, 10)

julia> println(collect(ans))
[-0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1]

This is in keeping with how we do lots of range arithmetic — it's O(1) where possible, even if it means that the values are different than the collected versions would be. For example, (0:.1:1) ./ 3 is 0:1/30:1/3 instead of the division of the values the range generates:

julia> (0:.1:1) ./ 3
0.0:0.03333333333333333:0.3333333333333333

julia> (collect(0:.1:1) ./ 3) .- collect((0:.1:1) ./ 3)
11-element Vector{Float64}:
  0.0
  0.0
  0.0
 -1.3877787807814457e-17
  0.0
  0.0
 -2.7755575615628914e-17
 -2.7755575615628914e-17
  0.0
  0.0
  0.0

So what's the rule? I think it's gotta be a little gray:

Where does this leave searching? No clue.

adienes commented 2 weeks ago

I have, in the past, though I do not remember the exact discussions, been informed of the following two stylized facts:

is it possible for a clean design to satisfy both? it may have to sacrifice "equally spaced."

andrewjradcliffe commented 2 weeks ago

Even better, diff(A::AbstractRange) could be implemented as r[begin:end-1] .- r[begin+1:end], which gives back the range:

julia> rangediff(r) = r[begin:end-1] .- r[begin+1:end] rangediff (generic function with 4 methods)

julia> rangediff(0:.1:1) StepRangeLen(-0.1, 0.0, 10)

julia> println(collect(ans)) [-0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1]

I think you mean rangediff(r) = r[begin+1:end] .- r[begin:end-1] ^_^;;

oscardssmith commented 2 weeks ago

triage defers to a higher power.

StefanKarpinski commented 2 weeks ago

Lol, took me a while to realize I'm supposed to be the higher power.

mbauman commented 2 weeks ago

So there are more than two possible semantics:

  1. There's the theoretical perfect spacing of 11 elements between 0.0 and 1.0 (or eleven increments of 1 starting from 1f8)
  2. There's the internal implementation, pre-rounding.
  3. There are potentially alternative implementations that try to work on what they think the maths behind the range are, based on the outputs of first(r), last(r), step(r), length(r).
  4. There's the rounded element-wise getindex approximation

In the case of searchsorted, we currently use a mix of 3 and 4. This means that we can find 0.3 in 0:0:.1:1, but I can't find the theoretical 3//10 — instead, both 3//10 and 4//10 correctly land between 0.3 and 0.4. That works, and it's "just" an optimization. But it assumes that the rounded range outputs (4) are monotonically increasing, which is why it falls apart with 1f8:

julia> searchsorted(1f8:1f8+8, 100000000)
1:1

julia> searchsorted(1f8:1f8+8, 100000001)
2:1

julia> searchsorted(collect(1f8:1f8+8), 100000001)
6:5

I think the rule has got to be about what the particular function returns. search/find/etc return indices — indices that are specific to the given range. Thus those indices must be able to be used as indices in the range and get the values they found.

Conversely, simple range arithmetic returns other ranges. As long as the start/stop and length are correctly computed, I think it's ok for the collected variants to give slightly different numeric answers. It's not terribly unlike different summation algorithms, really.

mbauman commented 2 weeks ago

search/find/etc return indices — indices that are specific to the given range. Thus those indices must be able to be used as indices in the range and get the values they found.

Amusingly Horrifyingly, though, I don't see any feasible way to get around this disconnect:

julia> r = 1f8:1f8+12
1.0f8:1.0f0:1.0000002f8

julia> idxs = searchsorted(collect(r), 100000000) # Assuming searchsorted returns the range outputs that match
1:5

julia> r[idxs]
1.0f8:1.0f0:1.0f8

julia> all(==(100000000), r[idxs])
true

julia> r[idxs] .- 100000000
0.0f0:1.0f0:4.0f0

julia> collect(ans)
5-element Vector{Float32}:
 0.0
 1.0
 2.0
 3.0
 4.0