pzimbrod commented 2 years ago

So far, FourierLayer only works in 1-D. This should be extended to more dimensions, ideally with minimal code repetition.

Replacing the standard implementation by a generated function can do this.

codecov[bot] commented 2 years ago

Codecov Report

Merging #31 (8e69b17) into master (9b16e02) will decrease coverage by 2.20%. The diff coverage is 35.71%.

:exclamation: Current head 8e69b17 differs from pull request most recent head 2e9d4f0. Consider uploading reports for the commit 2e9d4f0 to get more accurate results

@@            Coverage Diff             @@
##           master      #31      +/-   ##
==========================================
- Coverage   41.09%   38.88%   -2.21%     
==========================================
  Files           6        8       +2     
  Lines          73      108      +35     
==========================================
+ Hits           30       42      +12     
- Misses         43       66      +23

Impacted Files	Coverage Δ
src/OperatorLearning.jl	`100.00% <ø> (ø)`
src/generated.jl	`0.00% <0.00%> (ø)`
src/FourierLayer.jl	`43.75% <39.47%> (-1.42%)`	:arrow_down:
src/trainable.jl	`100.00% <100.00%> (ø)`

:mega: Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

pzimbrod commented 2 years ago

Also, I ditched rfft for regular fft for sake of simplicity. But since this roughly doubles computation time and memory requirements, it should be changed later on.

ChrisRackauckas commented 2 years ago

Also, I ditched rfft for regular fft for sake of simplicity. But since this roughly doubles computation time and memory requirements, it should be changed later on.

Can you capture this in an issue for a future performance improvement? Usually it's easier to remember these kinds of things when they are in the list.

pzimbrod commented 2 years ago

@ChrisRackauckas It seems that codecov complains about the new generated function not being tested. I imagine this can be fixed by taking care of #26 / including tests that call the layer on some data, right?

Can you capture this in an issue for a future performance improvement? Usually it's easier to remember these kinds of things when they are in the list.

Will do.

ChrisRackauckas commented 2 years ago

I imagine this can be fixed by taking care of https://github.com/SciML/OperatorLearning.jl/issues/26 / including tests that call the layer on some data, right?

Maybe. @generated + coverage is a bit unreliable IIRC.

pzimbrod commented 2 years ago

Training works fine on CPU and GPU with input randn(8,32,32,16,12) (3D Grid 32x32x16), but somehow memory requirements have shot up to a point where the burgers_FNO example runs out of memory on a 24GB RTX A5000 - which is a rather simple 1D case.

# examples/burgers_FNO.jl l.82
# Run on an RTX A5000 24GB
julia> train!(loss, parameters, train_loader, opt, cb = throttled_cb)
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.

[...] A whole lot more of those

ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: KernelException: exception thrown during kernel execution on device NVIDIA RTX A5000
Stacktrace:
  [1] check_exceptions()
    @ CUDA ~/.julia/packages/CUDA/t62lT/src/compiler/exceptions.jl:34
  [2] nonblocking_synchronize
    @ ~/.julia/packages/CUDA/t62lT/lib/cudadrv/context.jl:329 [inlined]
  [3] device_synchronize()
    @ CUDA ~/.julia/packages/CUDA/t62lT/lib/cudadrv/context.jl:317
  [4] CUDA.CuModule(data::Vector{UInt8}, options::Dict{CUDA.CUjit_option_enum, Any})
    @ CUDA ~/.julia/packages/CUDA/t62lT/lib/cudadrv/module.jl:41
  [5] CuModule
    @ ~/.julia/packages/CUDA/t62lT/lib/cudadrv/module.jl:23 [inlined]
  [6] cufunction_link(job::GPUCompiler.CompilerJob, compiled::NamedTuple{(:image, :entry, :external_gvars), Tuple{Vector{UInt8}, String, Vector{String}}})
    @ CUDA ~/.julia/packages/CUDA/t62lT/src/compiler/execution.jl:451
  [7] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/1Ajz2/src/cache.jl:95
  [8] cufunction(f::GPUArrays.var"#broadcast_kernel#17", tt::Type{Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(+), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(*), Tuple{Float64, Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(*), Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(-), Tuple{Int64, Float64}}, Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{ComplexF32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(conj), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{ComplexF32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}}}}}, Int64}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/t62lT/src/compiler/execution.jl:297
  [9] cufunction(f::GPUArrays.var"#broadcast_kernel#17", tt::Type{Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(+), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(*), Tuple{Float64, Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(*), Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(-), Tuple{Int64, Float64}}, Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{ComplexF32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(conj), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{ComplexF32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}}}}}, Int64}})
    @ CUDA ~/.julia/packages/CUDA/t62lT/src/compiler/execution.jl:291
 [10] macro expansion
    @ ~/.julia/packages/CUDA/t62lT/src/compiler/execution.jl:102 [inlined]
 [11] #launch_heuristic#274
    @ ~/.julia/packages/CUDA/t62lT/src/gpuarrays.jl:17 [inlined]
 [12] copyto!
    @ ~/.julia/packages/GPUArrays/umZob/src/host/broadcast.jl:65 [inlined]
 [13] copyto!
    @ ./broadcast.jl:913 [inlined]
 [14] materialize!
    @ ./broadcast.jl:871 [inlined]
 [15] materialize!
    @ ./broadcast.jl:868 [inlined]
 [16] apply!(o::ADAM, x::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, Δ::CUDA.CuArray{ComplexF32, 2, CUDA.Mem.DeviceBuffer})
    @ Flux.Optimise ~/.julia/packages/Flux/qAdFM/src/optimise/optimisers.jl:184
 [17] update!(opt::ADAM, x::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, x̄::CUDA.CuArray{ComplexF32, 2, CUDA.Mem.DeviceBuffer})
    @ Flux.Optimise ~/.julia/packages/Flux/qAdFM/src/optimise/train.jl:26
 [18] update!(opt::ADAM, xs::Zygote.Params, gs::Zygote.Grads)
    @ Flux.Optimise ~/.julia/packages/Flux/qAdFM/src/optimise/train.jl:32
 [19] macro expansion
    @ ~/.julia/packages/Flux/qAdFM/src/optimise/train.jl:112 [inlined]
 [20] macro expansion
    @ ~/.julia/packages/Juno/n6wyj/src/progress.jl:134 [inlined]
 [21] train!(loss::Function, ps::Zygote.Params, data::Flux.Data.DataLoader{Tuple{CUDA.CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}}, Random._GLOBAL_RNG}, opt::ADAM; cb::Flux.var"#throttled#74"{Flux.var"#throttled#70#75"{Bool, Bool, typeof(evalcb), Int64}})
    @ Flux.Optimise ~/.julia/packages/Flux/qAdFM/src/optimise/train.jl:107
 [22] top-level scope
    @ REPL[21]:1
 [23] top-level scope
    @ ~/.julia/packages/CUDA/t62lT/src/initialization.jl:52

So this will need some additional work.

pzimbrod commented 2 years ago

It seems that using @ein within @generated messes up type inference since i𝔉, 𝔉 and 𝔏 all show type Any. Perhaps related to under-Peter/OMEinsum.jl/issues/92.

julia> model = FourierLayer(8,8,(32,32,16),(2,3,4),σ)
FourierLayer with
Convolution path: (8, 8, 32, 32, 16)
Linear path: (8, 8)
Fourier modes: (2, 3, 4)
Activation: σ

julia> x = randn(Float32, 8, 32, 32, 16, 2);

julia> model(x);

julia> @code_warntype model(x)
MethodInstance for (::FourierLayer{typeof(σ), Array{ComplexF32, 5}, Matrix{Float32}, Array{ComplexF32, 5}, Array{Float32, 5}})(::Array{Float32, 5})
  from (a::FourierLayer)(x::AbstractArray{T, N}) where {T, N} in OperatorLearning at /home/zimbropa/OperatorLearning.jl/src/FourierLayer.jl:124
Static Parameters
  T = Float32
  N = 5
Arguments
  a::FourierLayer{typeof(σ), Array{ComplexF32, 5}, Matrix{Float32}, Array{ComplexF32, 5}, Array{Float32, 5}}
  x::Array{Float32, 5}
Locals
  i𝔉 ::Any
  𝔉 ::Any
  𝔏 ::Any
  xₚ::Array{Float32, 5}
  σ::typeof(sigmoid_fast)
  bₗ::Array{Float32, 5}
  bᵩ::Array{ComplexF32, 5}
  Wₗ::Matrix{Float32}
  Wᵩ::Array{ComplexF32, 5}
Body::Any
1 ─       (Wᵩ = Base.getproperty(a, :Wf))
│         (Wₗ = Base.getproperty(a, :Wl))
│         (bᵩ = Base.getproperty(a, :bf))
│         (bₗ = Base.getproperty(a, :bl))
│   %5  = Base.getproperty(a, :σ)::Core.Const(NNlib.σ)
│         (σ = OperatorLearning.fast_act(%5, x))
│   %7  = Core.tuple($(Expr(:static_parameter, 2)))::Core.Const((5,))
│   %8  = ($(Expr(:static_parameter, 2)) - 1)::Core.Const(4)
│   %9  = (1:%8)::Core.Const(1:4)
│   %10 = Base.Generator(Base.identity, %9)::Core.Const(Base.Generator{UnitRange{Int64}, typeof(identity)}(identity, 1:4))
│   %11 = Base.collect(%10)::Vector{Int64}
│   %12 = Core._apply_iterate(Base.iterate, OperatorLearning.tuple, %7, %11)::Core.PartialStruct(Tuple{Int64, Vararg{Int64}}, Any[Core.Const(5), Vararg{Int64}])
│         (xₚ = OperatorLearning.permutedims(x, %12))
│   %14 = Core.tuple((1, 2), (3, 2, 4, 5, 6))::Core.Const(((1, 2), (3, 2, 4, 5, 6)))
│   %15 = OMEinsum.EinCode(%14, (3, 1, 4, 5, 6))::OMEinsum.DynamicEinCode{Int64}
│   %16 = Core.tuple(Wₗ, xₚ)::Tuple{Matrix{Float32}, Array{Float32, 5}}
│         (𝔏 = OMEinsum.einsum(%15, %16))
│   %18 = Base.broadcasted(OperatorLearning.:+, 𝔏, bₗ) ::Any
│         Base.materialize(%18)
│   %20 = Core.tuple((1, 2, 3, 4, 5), (6, 1, 3, 4, 5))::Core.Const(((1, 2, 3, 4, 5), (6, 1, 3, 4, 5)))
│   %21 = OMEinsum.EinCode(%20, (6, 2, 3, 4, 5))::OMEinsum.DynamicEinCode{Int64}
│   %22 = Wᵩ::Array{ComplexF32, 5}
│   %23 = xₚ::Array{Float32, 5}
│   %24 = (3:$(Expr(:static_parameter, 2)))::Core.Const(3:5)
│   %25 = Base.Generator(Base.identity, %24)::Core.Const(Base.Generator{UnitRange{Int64}, typeof(identity)}(identity, 3:5))
│   %26 = Base.collect(%25)::Vector{Int64}
│   %27 = OperatorLearning.fft(%23, %26)::Array{ComplexF32, 5}
│   %28 = Core.tuple(%22, %27)::Tuple{Array{ComplexF32, 5}, Array{ComplexF32, 5}}
│         (𝔉 = OMEinsum.einsum(%21, %28))
│   %30 = Base.broadcasted(OperatorLearning.:+, 𝔉,  bᵩ)::Any
│         Base.materialize(%30)
│   %32 = 𝔉 ::Any
│   %33 = (3:$(Expr(:static_parameter, 2)))::Core.Const(3:5)
│   %34 = Base.Generator(Base.identity, %33)::Core.Const(Base.Generator{UnitRange{Int64}, typeof(identity)}(identity, 3:5))
│   %35 = Base.collect(%34)::Vector{Int64}
│         (i𝔉 = OperatorLearning.ifft(%32, %35))
│   %37 = σ::Core.Const(NNlib.sigmoid_fast)
│   %38 = 𝔏 ::Any
│   %39 = OperatorLearning.real(i𝔉) ::Any
│   %40 = (%38 + %39)::Any
│   %41 = Base.broadcasted(%37, %40)::Any
│   %42 = Base.materialize(%41)::Any
│   %43 = (2:$(Expr(:static_parameter, 2)))::Core.Const(2:5)
│   %44 = Base.Generator(Base.identity, %43)::Core.Const(Base.Generator{UnitRange{Int64}, typeof(identity)}(identity, 2:5))
│   %45 = Base.collect(%44)::Vector{Int64}
│   %46 = Core.tuple(1)::Core.Const((1,))
│   %47 = Core._apply_iterate(Base.iterate, OperatorLearning.tuple, %45, %46)::Tuple{Vararg{Int64}}
│   %48 = OperatorLearning.permutedims(%42, %47)::Any
└──       return %48

Can't use Tullio.jl as an alternative since it won't work in generated functions at all - mcabbott/Tullio.jl#129

SciML / OperatorLearning.jl

Implement FourierLayer for higher dimensions #31

Codecov Report