JuliaSmoothOptimizers / PartitionedVectors.jl

Other
8 stars 1 forks source link

Dot product #23

Open paraynaud opened 1 year ago

paraynaud commented 1 year ago

Specialize dot product when pv1.simulate_vector != pv2.simulate_vector.

Suppose $w$ a Vector represented by a PartitionedVector (storing $Ui w$) and $$v = \sum{i=1}^N U_i^\top \hat{v}_i$$ then

$$ v^\top w = (\sum_{i=1}^N U_i^\top \widehat{v}_i)^\top w = \sum_{i=1}^N \widehat{v}_i^\top U_i w,$$

which accumulates element scalar products.

If both have simulate_vector=true we should avoid to build the Vector. We should add a flag field saying if the Vector is build or not.

If both have simulate_vector=false we can try something more complex:

$$ \left (\sum_{i=1}^N Ui^\top \hat{v}_i \right)^\top \left (\sum{i=1}^N Ui^\top \hat{w}_i \right) = \left (\sum\{i=1}^N \hat{v}_i^\top Ui \right) \left (\sum{i=1}^N Ui^\top \hat{w}_i \right) = \sum{i=1}^N \sum_{j=1}^N \hat{v}_i^\top U_i U_j^\top \hat{w}_j $$

Even if there is $N^2$ terms, a majority of them should vanish because $U_i U_j^\top=0$ if there is no common variables between elements. After identifying those key pairs, we will have to access efficiently the value of their element components.

paraynaud commented 1 year ago

norm may benefit it too.

paraynaud commented 1 year ago

@dpo, Should I implement this before the submission of the article?

paraynaud commented 1 year ago

I started the implementation. For the first case, here is the gain:

N = 1500
n = 20000
nie = 15
element_variables = map((i -> sample(1:n, nie, replace = false)), 1:N)

pv = PartitionedVector(element_variables; n)
pv_vec = PartitionedVector(element_variables; n, simulate_vector=true)
pv_vec .= 1

dot(pv, pv_vec; warn=false)
@benchmark dot(pv, pv_vec; warn=false)
# main branch
# BenchmarkTools.Trial: 3389 samples with 1 evaluation.
#  Range (min … max):  873.900 μs …   3.272 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):       1.371 ms               ┊ GC (median):    0.00%        
#  Time  (mean ± σ):     1.458 ms ± 357.329 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
#         ▁▃███▆▅▅▄▂▃▂▁▂▂▂ ▁▁
#   ▁▁▂▂▃▆█████████████████████▆▆▅▆▅▅▄▄▄▄▄▃▄▃▃▃▂▃▃▂▃▂▂▂▂▂▂▂▂▂▂▂▂▁ ▄
#   874 μs           Histogram: frequency by time         2.55 ms <
#  Memory estimate: 16 bytes, allocs estimate: 1.

# pr-partitioned-dot PR
# BenchmarkTools.Trial: 10000 samples with 1 evaluation.
#  Range (min … max):  34.300 μs … 330.100 μs  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     37.500 μs               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   45.281 μs ±  18.760 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
#   ██▇▆▂▁▁ ▁▅▂▄ ▂▂  ▁ ▁  ▄ ▂▂     ▂      ▁     ▁ ▃▂             ▂
#   ███████████████▇▆█▆███████▇██▆▆██▆▆▃▇▆█▆▆▄▆█████▇▅▆▅▆▄▄▃▄▄▅▄ █
#   34.3 μs       Histogram: log(frequency) by time       109 μs <
#  Memory estimate: 48 bytes, allocs estimate: 2.

It will be use in trunk to evaluate the quadratic model $m_k$ in s (dot(g,s)).

At some point, we will have to make an heuristic about the partitioned structure to know which method use. In the case where the size of all elements is close to n, then the new dot will perform a N dot of size n.

dpo commented 1 year ago

In the case where the size of all elements is close to n, then the new dot will perform a N dot of size n.

If any element has size close to n, we should not use the PSS at all.

paraynaud commented 1 year ago

Do you mean the user should not apply a PartiallySeparableNLPModel, or the PartiallySeparableNLPModels should automatically consider only one function (e.g. the objective function) gathering every element functions? In the second case, we fall back on standard QuasiNewtonNLPModel.

I think the choice is not obvious, because limited-memory partitioned quasi-Newton operators keep an interest when the size of element functions in close to n.

We've been talking for a long time about a heuristic to determine which elements should be merged, I think that's one of the next features I should implement. I could adapt the merge depending the partitioned model considered.

dpo commented 1 year ago

But if there are at least 2 elements and one involves almost all n variables, it will almost surely be more expensive to manage the PSS than to ignore it. It's worth benchmarking a few examples to be sure.

Merging is a somewhat separate topic.