paraynaud commented 1 year ago

@dpo, this is the first version of an implementation of a compressed LBFGS (forward) operator. I made a first structure, as well as a Matrix interface, and a mul! method. There is a lot of room for improvement, but right now it is functionnal. I didn't add tests for now.

github-actions[bot] commented 1 year ago

Package name	latest	stable
CaNNOLeS.jl
DCISolver.jl
FletcherPenaltySolver.jl
JSOSolvers.jl
Krylov.jl
NLPModels.jl
NLPModelsModifiers.jl
PROPACK.jl
Percival.jl
QuadraticModels.jl
SolverTools.jl

paraynaud commented 1 year ago

@dpo, @tmigot, (@amontoison, @geoffroyleconte) I think this PR is pretty mature. Right now, a CompressedLBFGSOperator is faster to update and to perform mul! than the usual LBFGSOperator. Here are some @benchmark I ran on my personal laptop. It means to compare LBFGSOperator and CompressedLBFGSOperator:

n = 100000
m = 15
mem = m 
lbfgs = CompressedLBFGSOperator(n; m)
classic_lbfgs = LBFGSOperator(n; mem)

for i in 1:m #m
  s = rand(n)
  y = rand(n)
  push!(lbfgs, s, y)
  push!(classic_lbfgs, s, y)
end 

s = rand(n)
y = rand(n)

@benchmark push!(lbfgs, s, y) 
# BenchmarkTools.Trial: 10000 samples with 1 evaluation. # n=10000, m = 5
#  Range (min … max):  145.300 μs …   6.865 ms  ┊ GC (min … max): 0.00% … 96.10%
#  Time  (median):     159.900 μs               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   186.701 μs ± 206.696 μs  ┊ GC (mean ± σ):  3.75% ±  3.42%
#    █        
#   ▇█▇▃▂▂▂▂▆▆▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
#   145 μs           Histogram: frequency by time          440 μs <
#  Memory estimate: 86.28 KiB, allocs estimate: 16.

# BenchmarkTools.Trial: 6184 samples with 1 evaluation. # n=10000,  m=15
#  Range (min … max):  560.800 μs …   9.411 ms  ┊ GC (min … max): 0.00% … 87.77%
#  Time  (median):     733.700 μs               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   796.455 μs ± 299.321 μs  ┊ GC (mean ± σ):  0.95% ±  3.07%
#        ▅▆█▇▇▇▄▃▁ 
#   ▂▂▄▆██████████▇▇▅▅▄▄▃▃▃▂▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
#   561 μs           Histogram: frequency by time         1.55 ms <
#  Memory estimate: 112.86 KiB, allocs estimate: 16.

# BenchmarkTools.Trial: 741 samples with 1 evaluation. # n=100000, m=15
#  Range (min … max):  5.402 ms … 18.267 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     6.557 ms              ┊ GC (median):    0.00%
#  Time  (mean ± σ):   6.731 ms ±  1.057 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%
#          ▂▆█▂▃▄▂ ▁▂▁ ▁▃
#   ▃▂▃▄▅▃████████████████▇▇▆▆▆▅▄▆▅▅▄▃▄▃▃▃▃▃▂▂▂▂▃▂▁▁▃▁▁▁▂▁▁▁▁▂ ▄
#   5.4 ms         Histogram: frequency by time        9.66 ms <
#  Memory estimate: 112.86 KiB, allocs estimate: 16.

@benchmark push!(classic_lbfgs, s, y)
# BenchmarkTools.Trial: 10000 samples with 1 evaluation. # n=10000, m = 5
#  Range (min … max):  131.300 μs … 783.500 μs  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     135.500 μs               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   153.789 μs ±  51.113 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
#   █▆▄▄▂▁▁▁▁▁▂▂▁     ▁▁▁                                         ▁
#   ███████████████▇▇▇██████▇▇▆▇▇▆▇▆▆▆▆▆▆▆▇▇▆▆▆▆▅▅▅▅▅▅▅▅▆▄▅▅▅▃▅▅▄ █
#   131 μs        Histogram: log(frequency) by time        380 μs <
#  Memory estimate: 0 bytes, allocs estimate: 0.

# BenchmarkTools.Trial: 5073 samples with 1 evaluation. # n=10000, m = 15
#  Range (min … max):  858.900 μs …   2.479 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     895.200 μs               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   981.256 μs ± 176.964 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
#    ▆█▄▂▁▁▃▄▂▁▁▁▁▂▁▁  ▁▁▁                                        ▁
#   ███████████████████████████▆▇▇▇▇▇▇▇▆▇▆▆▆▇▆▇▆▆▆▅▆▆▆▆▆▅▄▃▅▆▅▅▅▅ █
#   859 μs        Histogram: log(frequency) by time       1.66 ms <
#  Memory estimate: 0 bytes, allocs estimate: 0.

# BenchmarkTools.Trial: 246 samples with 1 evaluation. # n=100000, m=15
#  Range (min … max):  19.177 ms … 27.130 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     19.926 ms              ┊ GC (median):    0.00%
#  Time  (mean ± σ):   20.311 ms ±  1.151 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%
#    ▁ ▁▁█▃▂▄    
#   ▅█▆████████▆▆▆▄▄▄▄▃▃▄▂▂▃▁▄▁▃▁▄▁▃▂▂▁▃▁▁▂▁▃▂▄▂▁▁▁▂▂▁▂▃▁▁▁▁▁▁▂ ▃
#   19.2 ms         Histogram: frequency by time        24.6 ms <
#  Memory estimate: 0 bytes, allocs estimate: 0.

Bv = similar(y)
v = ones(n)

@benchmark mul!(Bv, lbfgs, v)
# BenchmarkTools.Trial: 10000 samples with 1 evaluation. # n=10000,  m=5
#  Range (min … max):  63.000 μs …  1.067 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     87.100 μs              ┊ GC (median):    0.00%
#  Time  (mean ± σ):   96.820 μs ± 41.151 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
#     ▁▄▇██▇▆▅▅▄▃▃▂▁▁▁                                          ▂
#   ▂▅█████████████████████▇▇▇▇▆▇▇▅▅▆▅▆▄▆▅▆▆▅▆▅▆▆▅▆▆▅▆▅▆▅▆▅▅▆▅▆ █
#   63 μs        Histogram: log(frequency) by time       273 μs <
#  Memory estimate: 0 bytes, allocs estimate: 0.

# BenchmarkTools.Trial: 10000 samples with 1 evaluation. # n=10000, m=15
#  Range (min … max):  116.300 μs … 801.700 μs  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     128.700 μs               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   134.797 μs ±  29.547 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
#    ▄▆██▆▄▃▃▂▂▂▁▁                                                ▂
#   ████████████████▇▇▇▆▇▆▆▆▄▄▆▅▄▆▅▃▅▄▅▅▅▂▄▂▅▄▄▅▅▅▄▄▂▄▄▄▅▃▄▄▄▄▆▃▅ █
#   116 μs        Histogram: log(frequency) by time        295 μs <
#  Memory estimate: 0 bytes, allocs estimate: 0.

#  BenchmarkTools.Trial: 3477 samples with 1 evaluation. # n=100000, m=15
#  Range (min … max):  1.101 ms …   3.437 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     1.235 ms               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   1.412 ms ± 353.924 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
#    ▃▇█▂        
#   ▄████▆▄▃▃▃▃▂▃▃▂▃▃▃▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
#   1.1 ms          Histogram: frequency by time         2.6 ms <
#  Memory estimate: 0 bytes, allocs estimate: 0.

@benchmark mul!(Bv, classic_lbfgs, v)
# BenchmarkTools.Trial: 10000 samples with 1 evaluation. # n=10000,  m = 5
#  Range (min … max):  109.700 μs … 742.600 μs  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     111.600 μs               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   131.146 μs ±  49.643 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
#   █▄▄▃▂ ▁  ▁▁▂▁▁       ▁                                        ▁
#   ████████████████▇▇▇████▇█▇▆▇▆▇▇▇▆▇▇▅▆▆▆▆▆▆▆▆▇▆▅▅▆▅▆▅▅▅▅▅▄▅▅▄▄ █
#   110 μs        Histogram: log(frequency) by time        339 μs <
#  Memory estimate: 0 bytes, allocs estimate: 0.

# BenchmarkTools.Trial: 10000 samples with 1 evaluation. # n=10000,  m = 15
#  Range (min … max):  311.800 μs …  25.272 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     327.500 μs               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   394.421 μs ± 279.993 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
#   █▄▃▃▃▃▂▃▄▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁                                      ▁
#   ████████████████████████████▇██▇██▇▇▆▆▇▆▇▆▇▆▆▆▆▆▆▆▆▆▅▆▅▅▆▄▄▄▅ █
#   312 μs        Histogram: log(frequency) by time        905 μs <
#  Memory estimate: 0 bytes, allocs estimate: 0.

# BenchmarkTools.Trial: 951 samples with 1 evaluation. # n=100000, m=15
#  Range (min … max):  4.579 ms …   9.967 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     5.061 ms               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   5.241 ms ± 560.678 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
#      ▃█▅▄▂▃▁        
#   ▂▃▇███████▇▆▆▄▆▅▆▆▅▄▄▅▅▅▄▄▄▅▄▄▄▄▃▂▃▃▂▃▂▂▃▂▂▁▁▂▂▂▁▂▁▂▁▁▂▁▁▂▁ ▃
#   4.58 ms         Histogram: frequency by time        6.88 ms <
#  Memory estimate: 0 bytes, allocs estimate: 0.

Some allocations remain in push!(lbfgs,s,y), due to several mandatory matrix inversion (considering n == mem).

Next, I plan to include some tests for the CUDA architecture. I tried to design my code to work for both CPU and GPU.

geoffroyleconte commented 1 year ago

Hi, thanks for this PR. Usually in this package the functions to perform matrix-vector products are implemented in prod!, for example: https://github.com/JuliaSmoothOptimizers/LinearOperators.jl/blob/a2043722a09efdf64ed89a6d4912606a9c419ab9/src/lbfgs.jl#L201

Is it possible to make a similar change (your mul! function should go to prod!)?

And is it possible to remove CUDA from the deps? I can only see 3 lines where you use CUDA functions, maybe you could add something for the user to specify directly the type he needs if he wants to use CUDA? See for example: https://github.com/JuliaSmoothOptimizers/LinearOperators.jl/blob/a2043722a09efdf64ed89a6d4912606a9c419ab9/src/constructors.jl#L28

amontoison commented 1 year ago

It's more generic if you add an argument with the storage type S. It will also work with other GPU backends whereas your current implementation is restricted to Nvidia GPUs.

paraynaud commented 1 year ago

Hi, thanks for this PR. Usually in this package the functions to perform matrix-vector products are implemented in prod!, for example:

https://github.com/JuliaSmoothOptimizers/LinearOperators.jl/blob/a2043722a09efdf64ed89a6d4912606a9c419ab9/src/lbfgs.jl#L201

Is it possible to make a similar change (your mul! function should go to prod!)?

Adding prod! is not an issue. Since prod is a 5 arguments method, I will add it similarly to a LBFGSOperator as a new field of the structure.

And is it possible to remove CUDA from the deps? I can only see 3 lines where you use CUDA functions, maybe you could add something for the user to specify directly the type he needs if he wants to use CUDA? See for example:

It's more generic if you add an argument with the storage type S. It will also work with other GPU backends whereas your current implementation is restricted to Nvidia GPUs.

The user can specify directly the type he needs, for both the <:AbstractVector and the <:AbstractMatrix types, but I ease its job by determining which one use depending the architecture. Removing CUDA will make the operator more complicate to any new user. Another advantage, is that you don't need different test files for buildkite.

amontoison commented 1 year ago

And is it possible to remove CUDA from the deps? I can only see 3 lines where you use CUDA functions, maybe you could add something for the user to specify directly the type he needs if he wants to use CUDA? See for example: It's more generic if you add an argument with the storage type S. It will also work with other GPU backends whereas your current implementation is restricted to Nvidia GPUs.

The user can specify directly the type he needs, for both the <:AbstractVector and the <:AbstractMatrix types, but I ease its job by determining which one use depending the architecture. Removing CUDA will make the operator more complicate to any new user. Another advantage, is that you don't need different test files for buildkite.

You suppose that GPU == Nvidia but If tomorrow we want to use the same operator on Intel GPUs, will you add oneAPI.jl as a dependency? I don't understand your argument about test files for buildkite. If you want you can use buildkite to run all tests of LinearOperators if you can.

paraynaud commented 1 year ago

And is it possible to remove CUDA from the deps? I can only see 3 lines where you use CUDA functions, maybe you could add something for the user to specify directly the type he needs if he wants to use CUDA? See for example: It's more generic if you add an argument with the storage type S. It will also work with other GPU backends whereas your current implementation is restricted to Nvidia GPUs.

The user can specify directly the type he needs, for both the <:AbstractVector and the <:AbstractMatrix types, but I ease its job by determining which one use depending the architecture. Removing CUDA will make the operator more complicate to any new user. Another advantage, is that you don't need different test files for buildkite.

You suppose that GPU == Nvidia

For now, yes. But later, if necessary, I would improve default_gpu method as well as the structures defined accordingly.

but If tomorrow we want to use the same operator on Intel GPUs, will you add oneAPI.jl as a dependency?

I guess it will need oneAPI as a dependency.

I don't understand your argument about test files for buildkite. If you want you can use buildkite to run all tests of LinearOperators you can.

At the end, It demultipliates GPU packages as dependencies, but the exact same code will work on every architecture.

Personnally, I think it is better if the user doen't need to complete the type of the data-structures, but the drawback is having GPU packages as dependencies :/

geoffroyleconte commented 1 year ago

I agree with @amontoison here. Maybe you could use Requires.jl (see https://github.com/JuliaSmoothOptimizers/RipQP.jl/blob/d69c91618b95731f32cd2701f945bad1c254e3f6/src/RipQP.jl#L19) if there is no other choice, but I feel like it could be avoided if you make some minor changes and document well your operator.

If it is not possible, I suggest you make this PR work on the CPU only, and once it is merged we can discuss the GPU compatibility in another PR.

paraynaud commented 1 year ago

if you make some minor changes and document well your operator

It is sure, but I think it is the user's loss. I am thinking about Requires, but not sure how to make it work.

If it is not possible, I suggest you make this PR work on the CPU only, and once it is merged we can discuss the GPU compatibility in another PR.

I don't mind, even if I don't like it. But I prefer a clear answer on this one, rather than implementing this scheme to my other modules for less than nothing ;)

geoffroyleconte commented 1 year ago

Then I would advice not adding CUDA in the dependencies.

We used the following multiple times in this package:

https://github.com/JuliaSmoothOptimizers/LinearOperators.jl/blob/a2043722a09efdf64ed89a6d4912606a9c419ab9/src/constructors.jl#L28

Using what you propose would mean changing all these functions too if we want to have a coherent API for the whole package.

paraynaud commented 1 year ago

Then I would advice not adding CUDA in the dependencies.

We used the following multiple times in this package:

https://github.com/JuliaSmoothOptimizers/LinearOperators.jl/blob/a2043722a09efdf64ed89a6d4912606a9c419ab9/src/constructors.jl#L28

It would be better, the user would not have to change manually S in case where the architecture lauching the code is a GPU.

My motivation about implementating CompressedLBFGSOperator is the a current implementation of LBFGSOperator (forward) does not translate to GPU. So not everything is the module would need change.

amontoison commented 1 year ago

Then I would advice not adding CUDA in the dependencies. We used the following multiple times in this package:

https://github.com/JuliaSmoothOptimizers/LinearOperators.jl/blob/a2043722a09efdf64ed89a6d4912606a9c419ab9/src/constructors.jl#L28

It would be better, the user would not have to change manually S in case where the architecture lauching the code is a GPU.

My motivation about implementating CompressedLBFGSOperator is the a current implementation of LBFGSOperator (forward) does not translate to GPU. So not everything is the module would need change.

I don't see a big difference bewteen T=Float64, gpu=true and S=CuVector{Float64} sincerely. I'm not sure that default_gpu() should return true if you have a gpu available. If you use it in optimization methods that only working on cpu, you will some errors or slow conversions.

paraynaud commented 1 year ago

I don't see a big différence bewteen T=Float64, gpu=true and S=CuVector{Float64} sincerely.

My code needs neither of gpu=true nor S=CuVector{Float64}; T=Float64 being by default. If you want to switch to T=Float32, M and S will change accordingly to T.

I'm not sure that default_gpu() should return true if you have a gpu available.

If this corner case happens, the user will give its desired M and S types.

If you use it in optimization methods that only working on cpu, you will some errors or slow conversions.

I am not sure why I will have errors or slow conversions, which would not be the case if I implement it the other way.

Willing to have structures not related to the architecture, is for me the small minority of usages.

paraynaud commented 1 year ago

@geoffroyleconte I added Requires and removed CUDA from the dependencies, and the tests pass (for the CPU. To not mess up the __init__ of LinearOperators, I made a submodule ModCompressedLBFGSOperator and I use Require in it.

@amontoison unfortunately, I got (at least) one issue with the GPU :/

amontoison commented 1 year ago

Paul, could you test with the branch master of CUDA.jl? You just need to do a modification similar to https://github.com/JuliaSmoothOptimizers/Krylov.jl/blob/main/.buildkite/pipeline.yml#L33

paraynaud commented 1 year ago

It seems to work on GPU now, but it struggles to invert 3 intemediary matrices :/ These invertions should cap the performances of the method.

dpo commented 1 year ago

Let's not add submodules, please.

geoffroyleconte commented 1 year ago

I still think that removing CUDA entirely is a better solution. As mentionned by @amontoison you would also be able to use other GPU backends. You could add a constructor for your CompressedBFGSOperator in your other modules if they use CUDA.

paraynaud commented 1 year ago

I still think that removing CUDA entirely is a better solution. As mentionned by @amontoison you would also be able to use other GPU backends. You could add a constructor for your CompressedBFGSOperator in your other modules if they use CUDA.

It may use other GPU backends, and it can be done automatically by adding 2 lines in the Requires part (for each GPU bakend). In the worst case, the user defines the structures he wants with the optionnal parameters. Having the same code working for any architecture is an non-negligeable asset for me, especially for tutorials and demonstrations!

codecov[bot] commented 1 year ago

Codecov Report

Base: 97.32% // Head: 96.18% // Decreases project coverage by -1.14% :warning:

Coverage data is based on head (11e3b34) compared to base (9591943). Patch coverage: 82.79% of modified lines in pull request are covered.

:exclamation: Current head 11e3b34 differs from pull request most recent head 341332f. Consider uploading reports for the commit 341332f to get more accurate results

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #258 +/- ## ========================================== - Coverage 97.32% 96.18% -1.14% ========================================== Files 14 15 +1 Lines 1009 1102 +93 ========================================== + Hits 982 1060 +78 - Misses 27 42 +15 ``` | [Impacted Files](https://codecov.io/gh/JuliaSmoothOptimizers/LinearOperators.jl/pull/258?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaSmoothOptimizers) | Coverage Δ | | |---|---|---| | [src/compressed\_lbfgs.jl](https://codecov.io/gh/JuliaSmoothOptimizers/LinearOperators.jl/pull/258?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaSmoothOptimizers#diff-c3JjL2NvbXByZXNzZWRfbGJmZ3Muamw=) | `82.79% <82.79%> (ø)` | | | [src/utilities.jl](https://codecov.io/gh/JuliaSmoothOptimizers/LinearOperators.jl/pull/258?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaSmoothOptimizers#diff-c3JjL3V0aWxpdGllcy5qbA==) | `96.47% <0.00%> (+1.17%)` | :arrow_up: | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaSmoothOptimizers). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaSmoothOptimizers)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

amontoison commented 1 year ago

I still think that removing CUDA entirely is a better solution. As mentionned by @amontoison you would also be able to use other GPU backends. You could add a constructor for your CompressedBFGSOperator in your other modules if they use CUDA.

It may use other GPU backends, and it can be done automatically by adding 2 lines in the Requires part (for each GPU bakend). In the worst case, the user defines the structures he wants with the optionnal parameters. Having the same code working for any architecture is an non-negligeable asset for me, especially for tutorials and demonstrations!

Paul, the code will also work for any architecture is you use what we suggest with Geoffroy.

paraynaud commented 1 year ago

I still think that removing CUDA entirely is a better solution. As mentionned by @amontoison you would also be able to use other GPU backends. You could add a constructor for your CompressedBFGSOperator in your other modules if they use CUDA.

It may use other GPU backends, and it can be done automatically by adding 2 lines in the Requires part (for each GPU bakend). In the worst case, the user defines the structures he wants with the optionnal parameters. Having the same code working for any architecture is an non-negligeable asset for me, especially for tutorials and demonstrations!

Paul, the code will also work for any architecture is you use what we suggest with Geoffroy.

I know. I don't understand the issue, is it because I added Requires as a dependency? If the purpose of the tools we provide is to be democratized, what I propose is really usefull with no drawback now that CUDA is not a dependency anymore. Between:

CompressedLinearOperator(n)

runnable on any architecture and

CompressedLinearOperator(n; T=..., V=CUDA.CuVector{T, CUDA.Mem.DeviceBuffer}, M=CUDA.CuMatrix{T, CUDA.Mem.DeviceBuffer})

that should be changed for each architecture, the first option is clearly better (for me).

amontoison commented 1 year ago

The issue is that I didn't understand what you want to achieve. Your previous message helped me to understand.

Your example has a big drawback, you will have a different behavior depending if you have a GPU or not and the user doesn't have control on it. We practice we don't want to always use an operator or routine specialized for GPUs. Julia packages with GPU support are not democratized and it will never be the case (from my point of view) with the constraints related to GPU programming.

About your previous message, the type of the GPU buffer is not required in the constructor. T could have a default value related to S or M argument because you can recover it with eltype(S) or eltype(M). You can also only provide S or M with this function: https://github.com/JuliaSmoothOptimizers/Krylov.jl/blob/main/src/krylov_utils.jl#L241

So, the question is do we prefer:

CompressedLinearOperator(n)
# Which GPU backend will be used in the future if you have an Intel CPU and a Nvidia GPU because OneAPI.jl and CUDA.jl are both installed?
# Do we need to provide all options if we have a Nvidia GPU but we don't want to use it?

CompressedLinearOperator(n, S=CuVector{T})

paraynaud commented 1 year ago

About your previous message, the type of the GPU buffer is not required in the constructor.

I tried without it, but it failed to type correctly

Dₖ = Diagonal(V(undef, mem))

and adding CUDA.Mem.DeviceBuffer made it work.

Your example has a big drawback, you will have a different behaviour depending if you have a GPU or not and the user doesn't have control on it. We practice we don't want to always use an operator or routine specialized for GPUs

There is no drawback because the user has still control

CompressedLBFGSOperator(n::Int; mem::Int=5, T=Float64, M=default_matrix_type(; T), V=default_vector_type(; T))

the trick is that default_matrix_type and default_vector_type return Matrix and Vector by default, but the Requires part overwrite those two methods depending on the modules loaded. If you don't want to rely on the default M and V, you just have to parametrize M and V, replacing default_matrix_type and default_vector_type, as Geoffroy suggested, but I believe that most of the time you should not have to use it.

T could have a default value related to S or M argument because you can recover it with eltype(S) or eltype(M).

I followed the LBFGSOperator API, which has also T as a parameter.

If there is several GPU backend modules loaded, fair enough, the user will have to type M and V manually.

geoffroyleconte commented 1 year ago

The LBFGS Operator in this package uses scalar indexing so it is not made for GPUs anyways. As I mentionned in a previous message, the API in this package currently uses this: https://github.com/JuliaSmoothOptimizers/LinearOperators.jl/blob/9591943da03d9f113f47449447df2603305d2d74/src/constructors.jl#L39 (you can use S = CuArray{Float64, 1, CUDA.Mem.DeviceBuffer} for example). Having multiple ways to do the same thing for different operators is confusing. I am not against changing the API for the whole package but I think that this should not be done in this PR.

JuliaSmoothOptimizers / LinearOperators.jl

Compressed LBFGS (forward) operator #258

Codecov Report