diku-dk / futhark

:boom::computer::boom: A data-parallel functional programming language
http://futhark-lang.org
ISC License
2.37k stars 164 forks source link

pyopencl compiler forgets to define LocalMmemory sizes #1170

Closed evilmav closed 3 years ago

evilmav commented 3 years ago

When running fail in the following code from python, I get "NameError: name 'bytes_6155' is not defined" in the python module, which corresponds to LocalMemory size in set_args for the kernel. I've tried to boil down a smaller example from a larger code as far as I could, so the following does not have to make sense:

-- Futhark
let scan_relaxed [n] 'a 'b (f: a -> b -> a) (init: a) (items: [n]b): [n]a =
    let states=
        loop s = [init] for i in 1...n do
                s ++ [f s[i-1] items[i-1]]
    in states[1:] :> [n]a 

let thread [m] (n: i64) (blas: [m]f32) (blub: f32)
        : [m]f32 =    
    let steady = replicate n blub
    let blasums = map (+blub) blas
    --let blasums = blas
    let step (s, iprev) (inext) = (s, inext)    
    let states = scan_relaxed step (steady, blub) blasums    
    in map (\(s, i) -> i + s[1]) states

entry fail [m] [k] (n: i64) (irf: [m]f32) (ibias: [k]f32) 
        : [k][m]f32 =
    let thread = thread n irf
    in map (thread) ibias
# Python
irf = np.array([1,2,3], dtype=np.float32)
fail(10, irf, irf)

Interestingly, if I replace let blasums with the commented version, compiler will break up with encountered known limitation.

PS I have to try to make it work soon, if you could suggest a workaround I can try in the pattern, would be very appreciated.

PSS There has to be a less messy way to implement scan_relaxed...

athas commented 3 years ago

Which version of Futhark are you using? With the version in Git, the compiler itself will crash with a compiler limitation. I think that's because I actually made a fix recently that made the compiler more aware of its own limitations, while before it would sometimes just generate invalid code. I think that is what is happening here.

What is scan_relaxed supposed to do? How is it semantically different than a normal scan? Anyway, here is a nicer way to write it that does not run into limitations of related to memory expansion, because all the sizes are known in advance:

let scan_relaxed [n] 'a 'b (f: a -> b -> a) (init: a) (items: [n]b): [n]a =
  let states=
    loop s = replicate n init for i in 1...n do
      s with [i] = f (copy s[i-1]) items[i-1]
  in states

It's still completely sequential, though.

evilmav commented 3 years ago

Thank you for the quick response! I've used v0.18.1. I think it should be 1..<n, you've just saved my WE! =)

In short what I'm trying to do is to solve a Fokker-Planck equation over time, where a would be a tuple of (Fourier expansion of probability density, last input at a time) and b are the actual inputs of the system. The last map in thread simply extracts the interesting part from the state. Each step f involves solving a problem using thomas algorithm etc and as far as I understand is not within restrictions of built-in parallelizable scan (or any way of parallelization I can think of).

But because I will commonly run this over a set of thousands bias points (ibias in fail()) at a time, this map over thread should allow for reasonable parallelization, should not it?

athas commented 3 years ago

Sure, if the top level parallelism is sufficient, then it's not a problem that an inner loop is sequential. Futhark generates fairly tight code for sequential loops.

evilmav commented 3 years ago

I confirm the current git version does detect the compiler limitation, though with a warning not mentioning "states", so will be rather hard to understand the source. ("Cannot handle un-sliceable allocation size: (_group (#groups=k_5079; groupsize=m_5078), bytes_5674, @local)"). Can be closed...

If I may ask a stupid question: in the example above, I save states in the scan, but use only a fraction of the state in the lambda expression of the following map. Will this generally result in an actual temporary buffer of the states, or is the compiler miraculously smart enough to combine the following map back into it and only keep the output of the lambda in buffer?

athas commented 3 years ago

It will compute the full temporary buffer. Futhark generally doesn't have any optimisations that change the asymptotics of your program (they tend to be brittle in the practice and result in unpredictable performance cliffs).

Those compiler limitation errors really suck, and many of them are not very helpful. Fortunately, the kind of program you wrote here is about the only case where they show up. If you really need irregular allocations (you don't here), you can always use the multicore backend, which does not have the same limitations as the GPU backends, and can handle anything.