AMG fails on a Laplacian-like Problem

cortner commented 6 years ago

I've tried to incorporate AMG into one of my codes, where -- unfortunately -- it fails quite spectacularly:

import JLD, AMG
A, f = JLD.load("AMG_example.jld", "A", "f")
x = AMG.solve(AMG.ruge_stuben(A), f, 1000, AMG.V(), 1e-10)
@show vecnorm(A \ f - x, Inf)
@show vecnorm(A * x - f, Inf)
@show any(isnan.(A\f))
@show vecnorm(A - A', Inf)
@show minimum(eigvals(Symmetric(full(A))))

generates the output

vecnorm(A \ f - x, Inf) = Inf
vecnorm(A * x - f, Inf) = NaN
any(isnan.(A \ f)) = false
vecnorm(A - A', Inf) = 0.0
minimum(eigvals(Symmetric(full(A)))) = 0.009999999999991717

The file needed to reproduce this can be obtained from HERE

The matrix A here is essentially H + \epsilon I. where \epsilon ~ 0.01, and H can be thought of as a weighted graph-laplacian of a two-dimensional network.

ranjanan commented 6 years ago

@cortner I wasn't able to reproduce this. Which version of AMG were you on? v0.1.0?

ranjanan commented 6 years ago

With v0.1.0 (and on master), I get:

julia> import JLD, AMG

julia> A, f = JLD.load("AMG_example.jld", "A", "f");

julia> x = AMG.solve(AMG.ruge_stuben(A), f, 1000, AMG.V(), 1e-10);
600-element Array{Float64,1}:
 -0.0251331  
 -0.00494112 
  0.0162619  
 -0.0011254  
  0.00795969 
 -0.0150707  
 -0.0141143  
 -0.00532021 
 -0.0177023  
  0.0181578  
 -0.0127625  
 -0.000450548
 -0.0145081  
  0.0260964  
 -0.0107832  
  0.00765703 
 -0.0163651  
 -0.0220383  
  0.0284923  
  ⋮          
 -0.010587   
  0.0188101  
  0.0012849  
 -0.00156493 
 -0.0276459  
 -0.00969683 
 -0.0063156  
  0.00978282 
 -0.015404   
  0.0230772  
 -0.0296537  
 -0.0207526  
  0.0152367  
 -0.0114013  
 -0.0131285  
 -0.0156694  
  0.0195659  
  0.0140342  
  0.00516803 

julia> @show vecnorm(A \ f - x, Inf)
vecnorm(A \ f - x, Inf) = 2.2245239938295525e-11
2.2245239938295525e-11

julia> @show vecnorm(A * x - f, Inf)
vecnorm(A * x - f, Inf) = 8.179806831876135e-11
8.179806831876135e-11

julia> @show any(isnan.(A\f))
any(isnan.(A \ f)) = false
false

julia> @show vecnorm(A - A', Inf)
vecnorm(A - A', Inf) = 0.0
0.0

julia> @show minimum(eigvals(Symmetric(full(A))))
minimum(eigvals(Symmetric(full(A)))) = 0.009999999999994012
0.009999999999994012

cortner commented 6 years ago

julia> versioninfo()
Julia Version 0.6.2
Commit d386e40c17 (2017-12-13 18:08 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin17.3.0)
  CPU: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT NO_AFFINITY NEHALEM)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

julia> Pkg.status("AMG")
 - AMG                           0.1.0              master

cortner commented 6 years ago

I tested that it happens on a fresh restart of Julia

cortner commented 6 years ago

Same happens with

x = AMG.solve(AMG.smoothed_aggregation(A), f, 1000, AMG.V(), 1e-10)

ranjanan commented 6 years ago

I'm on

julia> Pkg.status("AMG")
 - AMG                           0.1.0

julia> versioninfo()
Julia Version 0.6.2
Commit d386e40c17 (2017-12-13 18:08 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, broadwell)

And with smoothed_aggregation it's giving:

julia> @show vecnorm(A \ f - x, Inf)
vecnorm(A \ f - x, Inf) = 2.0854013828286444e-11
2.0854013828286444e-11

julia> @show vecnorm(A * x - f, Inf)
vecnorm(A * x - f, Inf) = 7.806648383290593e-11
7.806648383290593e-11

julia> @show any(isnan.(A\f))
any(isnan.(A \ f)) = false
false

julia> @show vecnorm(A - A', Inf)
vecnorm(A - A', Inf) = 0.0
0.0

julia> @show minimum(eigvals(Symmetric(full(A))))
minimum(eigvals(Symmetric(full(A)))) = 0.009999999999994012
0.009999999999994012

ranjanan commented 6 years ago

I wonder if it's to do with the Haswell vs NEHALEM on our stock openblas builds

cortner commented 6 years ago

would it be helpful if I download Julia rather than build locally and test there?

ranjanan commented 6 years ago

Yes, could you try it on a downloaded version of Julia too? I just tried on an older machine with this:

julia> versioninfo()
Julia Version 0.6.1
Commit 0d7248e2ff (2017-10-24 22:15 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E7- 8850  @ 2.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Nehalem)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, westmere)

and wasn't able to reproduce either.

cortner commented 6 years ago

julia> versioninfo()
Julia Version 0.6.1
Commit 0d7248e (2017-10-24 22:15 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

julia> norm(AMG.solve(AMG.ruge_stuben(A), f, 1000, AMG.V(), 1e-10), Inf)
Inf

cortner commented 6 years ago

julia> versioninfo()
Julia Version 0.6.2
Commit d386e40c17 (2017-12-13 18:08 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

julia> norm(AMG.solve(AMG.ruge_stuben(A), f, 1000, AMG.V(), 1e-10), Inf)
Inf

ranjanan commented 6 years ago

Hmm. Very strange. Let me try finding hardware similar to yours so I can reproduce this and fix. Would it be possible for you to test on any other hardware you have available and see if it works?

cortner commented 6 years ago

not right away, but yes, I'll do that.

ranjanan commented 6 years ago

There's always Juliabox :-)

cortner commented 6 years ago

Yeah - I have to say I’ve been less than thrilled with it :(. Not to worry, I’ll send you more tests.

Sent from my iPhone

On 8 Mar 2018, at 10:41, Ranjan Anantharaman notifications@github.com<mailto:notifications@github.com> wrote:

There's always Juliabox :-)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ranjanan/AMG.jl/issues/32#issuecomment-371450361, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHVRQ-hhVWE7JnPJpCMAbPzZVVuFupXjks5tcQregaJpZM4ShuTn.

cortner commented 6 years ago

next test: this is my late 2017 MacBook Pro

julia> norm(solve(AMG.ruge_stuben(A), f; maxiter=1_000), Inf)
Inf

julia> versioninfo()
Julia Version 0.6.2
Commit d386e40c17 (2017-12-13 18:08 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin17.3.0)
  CPU: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT NO_AFFINITY NEHALEM)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, broadwell)
julia> Pkg.status("AMG")
 - AMG                           0.1.0+             master

cortner commented 6 years ago

my office workstation - a ca 4 year old Linux machine, running Ubuntu:

julia> norm(AMG.solve(ruge_stuben(A), f; maxiter = 1_000), Inf)
0.03725863138068789

julia> versioninfo()
Julia Version 0.6.3-pre.0
Commit 93168a6 (2017-12-18 07:11 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, ivybridge)

The only other difference here is that I use a development version of Julia v0.6.2. If you think that can make a difference, I can experiment with that.

ranjanan commented 6 years ago

So it works on something. Good, that's a start.

Can you run x, res = solve(AMG.ruge_stuben(A), f; maxiter=1_000, log = true) and paste res here please? On both the config where it works and where it doesn't.

cortner commented 6 years ago

On Apple Laptop:

julia> x, res = AMG.solve(AMG.ruge_stuben(A), f; maxiter=1_000, log = true)
([8.07392e218, 1.16689e238, Inf, 8.22031e218, 1.18019e238, Inf, 8.10053e218, 1.16562e238, Inf, 8.17945e218  …  Inf, 8.00242e218, 1.16566e238, Inf, 8.1132e218, 1.14831e238, Inf, 7.85136e218, 1.16336e238, Inf], [6.32894, 2.71108, 7.21397, 8.29327, 53.8207, 132.188, 430.102, 414.07, 2863.47, 59123.7  …  1.66233e302, 1.47662e302, 3.11071e303, 5.58143e303, 1.09626e305, 2.54711e305, 4.30591e306, 8.33508e307, 1.08269e308, NaN])

cortner commented 6 years ago

Linux workstation:

julia> x, res = AMG.solve(AMG.ruge_stuben(A), f; maxiter=1_000, log = true)
([-0.0251337, -0.0049412, 0.0162617, -0.00112588, 0.0079593, -0.0150709, -0.0141149, -0.00532007, -0.0177025, 0.0181574  …  0.0230771, -0.0296543, -0.0207528, 0.0152366, -0.011402, -0.0131283, -0.0156696, 0.0195651, 0.0140341, 0.00516779], [6.32894, 0.0123846, 0.00049327, 2.80065e-5])

cortner commented 6 years ago

I just checked the same test on an equivalent laptop of one of my postdocs and there is no problem there. So there must be something specific about my system. Any ideas?

EDIT: she is running Julia v0.6.1

ranjanan commented 6 years ago

Well, the desired residuals show up on your Linux machine. [6.32894, 0.0123846, 0.00049327, 2.80065e-5]

But on your mac, it blows up on the second iteration itself: [6.32894, 2.71108, 7.21397, 8.29327, 53.8207.....].

I think the best way forward is making a branch with useful debug statements and to track the execution and see where exactly things are going wrong.

cortner commented 6 years ago

How come travis testing is off for OS X? I know the standard script doesn't work; I can probably fix it though - I recently did the same for all my packages.

cortner commented 6 years ago

This is getting weirder - I can reproduce the "bug" (for lack of a better word) on a clean Julia v0.6.1 on an identical machine as the one my postdoc has where my test passes ok. I'd love to know what is going on here?

ranjanan commented 6 years ago

Mostly for convenience actually. Travis sometimes took a long time to boot an OSX VM. I don't think that's an issue. I develop on a Mac too, and it works fine for me.

Can you run:

Pkg.checkout("AMG", "debug")

and paste the output of both the following commands:

ruge_stuben(A)

and

norm(AMG.solve(ruge_stuben(A), f; maxiter = 1_000), Inf)

please?

The output maybe a little long on the second one, but could you show me anyway? I'm just trying to trace where the deviation first begins. It shouldn't really happen because all the sparse matrix matrix multiplies and the sparse matrix vector multiples are in pure julia.

cortner commented 6 years ago

I'll have a go at that, and then we can consider your suggestion from your email.

To add insult to injury: I've created a second user account on my laptop, used the Julia 0.6.2 downloaded from the julialang website, installed only JLD and AMG, and reran my test => it passed. So this means even on my own system it fails / passes depending on the user.

ChrisRackauckas commented 6 years ago

Is similar used somewhere? If so, my guess is this might be non-deterministic due to left-over "junk". You may want to change those to zeros and see if it still happens.

cortner commented 6 years ago

Chris - thanks for the suggestion, unfortunately this didn't fix it.

cortner commented 6 years ago

julia> Pkg.status("AMG")
 - AMG                           0.1.0+             debug
julia> import JLD, AMG; A, f = JLD.load("AMG_example.jld", "A", "f");

julia> AMG.ruge_stuben(A)
Multilevel Solver
-----------------
Operator Complexity: 1.096
Grid Complexity: 1.173
No. of Levels: 4
Coarse Solver: AMG.Pinv()
Level     Unknowns     NonZeros
-----     --------     --------
    1          600        27600 [91.23%]
    2           90         2610 [ 8.63%]
    3           11           41 [ 0.14%]
    4            3            3 [ 0.01%]
julia> norm(AMG.solve(AMG.ruge_stuben(A), f; maxiter = 1_000), Inf)
lvl = 1
norm(res) = 0.16982460303747857
lvl = 2
norm(res) = 0.015541058197269964
lvl = 3
norm(res) = 0.0007631747757248453
lvl = 1
norm(res) = 0.004487746791850987
lvl = 2
norm(res) = 0.000734711679690242
lvl = 3
norm(res) = 2.081855908096134e-5
lvl = 1
norm(res) = 0.00024118065143577877
lvl = 2
norm(res) = 3.861911660869666e-5
lvl = 3
norm(res) = 1.2664485637809385e-6
0.03725863138068789

I want to swear right now . . .

cortner commented 6 years ago

just to make sure:

julia> Pkg.status("AMG")
 - AMG                           0.1.0+             master

julia> import JLD, AMG; A, f = JLD.load("AMG_example.jld", "A", "f");
INFO: Recompiling stale cache file /Users/ortner/.julia/lib/v0.6/AMG.ji for module AMG.
INFO: Initializing AMG to use 4 threads

julia> AMG.ruge_stuben(A)
\^R
Multilevel Solver
-----------------
Operator Complexity: 1.096
Grid Complexity: 1.173
No. of Levels: 4
Coarse Solver: AMG.Pinv()
Level     Unknowns     NonZeros
-----     --------     --------
    1          600        27600 [91.23%]
    2           90         2610 [ 8.63%]
    3           11           41 [ 0.14%]
    4            3            3 [ 0.01%]

julia> norm(AMG.solve(AMG.ruge_stuben(A), f; maxiter = 1_000), Inf)
Inf

cortner commented 6 years ago

I've had similar problems in the past, when adding some completely unrelated code or print statements somehow fixed the problem. Suggests that Chris is right about the cause of the problem?

ranjanan commented 6 years ago

Okay so can you unset JULIA_NUM_THREADS and then run it again on master please? That threading code was definitely buggy, and commented it out on the "debug" branch. Were you always using threads in your code base?

cortner commented 6 years ago

I'll do that in a second. In the meantime:

If I run the debug branch with a clean Julia => no bug
If I first run the master branch, then switch to debug and reload => then I see the bug again.

cortner commented 6 years ago

looks like that was it:

julia> import JLD, AMG; A, f = JLD.load("AMG_example.jld", "A", "f");
INFO: Recompiling stale cache file /Users/ortner/.julia/lib/v0.6/AMG.ji for module AMG.

julia> reload("AMG"); norm(AMG.solve(AMG.ruge_stuben(A), f; maxiter = 1_000), Inf)
WARNING: Method definition length(Any) in module AMG at /Users/ortner/.julia/v0.6/AMG/src/multilevel.jl:21 overwritten in module AMG at /Users/ortner/.julia/v0.6/AMG/src/multilevel.jl:21.
WARNING: replacing module AMG.
0.03725863138068789

cortner commented 6 years ago

so it's the sparse matrix-vector multiplication which you multi-threaded? I'm glad you found this. If you still need access to a machine where the bug can be reproduced let me know. I'll need to think about how I'll go about this though.

Were you always using threads in your code base?

do you mean in my own codes? No, I haven't yet started multi-threading them.

ranjanan commented 6 years ago

Ah, I see. I had no idea you were running with threads.

so it's the sparse matrix-vector multiplication which you multi-threaded?

Yes, I did. It was some experimental code I put in there to try something out, and I didn't document it on the README. Looks like I should just remove it until I get it right because it's too much trouble.

cortner commented 6 years ago

So this was actually my setting? That would also explain why it only fails on my configuration. I didn’t even realise I had it. Sorry about that.

ranjanan commented 6 years ago

No problem, happy to help. 👍 I hope you could use the package single-threaded for now. Do let me know if you face any other problems.

cortner commented 6 years ago

thank you!

JuliaLinearAlgebra / AlgebraicMultigrid.jl

AMG fails on a Laplacian-like Problem #32