JuliaLinearAlgebra / libblastrampoline

Using PLT trampolines to provide a BLAS and LAPACK demuxing library.
MIT License
66 stars 17 forks source link

Add support for ILP64 Accelerate #113

Closed staticfloat closed 1 year ago

staticfloat commented 1 year ago

The new Accelerate released in macOS v13.3 provides two new interfaces; an upgraded LAPACK for the LP64 interface, and an all-new ILP64 interface (that uses the same upgraded LAPACK). These symbols are available from Accelerate with the suffix $NEWLAPACK and $NEWLAPACK$ILP64, respectively.

Unfortunately, this is not a "true" suffix, as Apple has decided to drop the trailing underscore from the typical F77 names, meaning that a symbol such as dgemm_ gets mangled to dgemm$NEWLAPACK, whereas a CBLAS symbol such as cblas_zdotc_sub gets mangled to cblas_zdotc_sub$NEWLAPACK. This means that we need to selectively erase the trailing underscore from some symbols when applying this Accelerate suffix.

To do this, we add a new feature, enabled by default only on Apple builds, called SYMBOL_TRIMMING, which allows a suffix_hint to contain the ASCII "substitution character" 0x1a as the first character of the suffix hint to mean "remove a trailing underscore when applying this suffix".

To make dealing with suffix hints easier for command-line users, these suffix hints are available for use in LBT_BACKING_LIBS by listing libraries separated by suffix hints with an exclamation point, e.g. libname!suffix.

staticfloat commented 1 year ago

To test this branch out on your local Julia, run the following script:

JULIA_PREFIX=/path/to/julia
# Back up normal LBT
for f in ${JULIA_PREFIX}/lib/julia/libblastrampoline*; do
    cp -v ${f} ${f}.backup
done

# Build new LBT
cd src
make

# Install new LBT into Julia's lib/julia directory
cp build/libblastrampoline.5.dylib ${JULIA_PREFIX}/lib/julia/libblastrampoline.5.dylib
cp build/libblastrampoline.5.dylib ${JULIA_PREFIX}/lib/julia/libblastrampoline.dylib

Then, when you launch Julia, load Accelerate via:

using LinearAlgebra

# Load LP64 interface first
BLAS.lbt_forward("/System/Library/Frameworks/Accelerate.framework/Accelerate"; suffix_hint="\x1a\$NEWLAPACK", verbose=true, clear=true)

# Load ILP64 interface next
BLAS.lbt_forward("/System/Library/Frameworks/Accelerate.framework/Accelerate"; suffix_hint="\x1a\$NEWLAPACK\$ILP64", verbose=true)

# Ensure that an ILP64 interface was actually loaded:
config = BLAS.get_config()
if !any(lib.interface == :ilp64 for lib in config.loaded_libs)
    @error("No ILP64 interfaces found; are you sure you're on macOS 13.3 or higher?!")
end

# Show LBT config:
display(BLAS.get_config())
ViralBShah commented 1 year ago

Verified that this works for me, and for peakflops() gives twice the flop rate.

staticfloat commented 1 year ago

Can someone try running the LinearAlgebra tests with Accelerate loaded?

ViralBShah commented 1 year ago

Results of the LinearAlgebra tests. Looks pretty good. The get_num_threads needs adjusting for Accelerate, and very few tests are failing.

➜  julia git:(vs/accelerate64) ✗ ./julia test/runtests.jl LinearAlgebra         
Running parallel tests with:
  nworkers() = 8
  nthreads() = 1
  Sys.CPU_THREADS = 8
  Sys.total_memory() = 32.000 GiB
  Sys.free_memory() = 12.930 GiB

Test                          (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB)
LinearAlgebra/matmul               (4) |        started at 2023-04-01T23:41:15.776
LinearAlgebra/diagonal             (7) |        started at 2023-04-01T23:41:15.820
LinearAlgebra/addmul               (2) |        started at 2023-04-01T23:41:15.861
LinearAlgebra/special              (8) |        started at 2023-04-01T23:41:15.861
LinearAlgebra/dense                (5) |        started at 2023-04-01T23:41:15.862
LinearAlgebra/triangular           (3) |        started at 2023-04-01T23:41:15.862
LinearAlgebra/symmetric            (6) |        started at 2023-04-01T23:41:15.862
LinearAlgebra/bidiag               (9) |        started at 2023-04-01T23:41:15.862
LinearAlgebra/special              (8) |    83.84 |   2.54 |  3.0 |   13877.34 |  1423.22
LinearAlgebra/qr                   (8) |        started at 2023-04-01T23:42:39.839
LinearAlgebra/bidiag               (9) |    90.00 |   2.85 |  3.2 |   13317.18 |  1635.91
LinearAlgebra/cholesky             (9) |        started at 2023-04-01T23:42:45.923
LinearAlgebra/dense                (5) |   115.04 |   4.13 |  3.6 |   17711.68 |  1667.75
LinearAlgebra/blas                 (5) |        started at 2023-04-01T23:43:10.965
LinearAlgebra/diagonal             (7) |   119.24 |   4.46 |  3.7 |   18937.22 |  1695.06
LinearAlgebra/lu                   (7) |        started at 2023-04-01T23:43:15.170
LinearAlgebra/qr                   (8) |    42.00 |   1.19 |  2.8 |    6571.17 |  2135.47
LinearAlgebra/uniformscaling       (8) |        started at 2023-04-01T23:43:21.846
LinearAlgebra/cholesky             (9) |         failed at 2023-04-01T23:43:26.173
Test Failed at /Users/viral/julia/usr/share/julia/stdlib/v1.10/LinearAlgebra/test/cholesky.jl:149
  Expression: all(diff(diag(real(cpapd.factors))) .<= 0.0)

Test Failed at /Users/viral/julia/usr/share/julia/stdlib/v1.10/LinearAlgebra/test/cholesky.jl:149
  Expression: all(diff(diag(real(cpapd.factors))) .<= 0.0)

Test Failed at /Users/viral/julia/usr/share/julia/stdlib/v1.10/LinearAlgebra/test/cholesky.jl:149
  Expression: all(diff(diag(real(cpapd.factors))) .<= 0.0)

Test Failed at /Users/viral/julia/usr/share/julia/stdlib/v1.10/LinearAlgebra/test/cholesky.jl:149
  Expression: all(diff(diag(real(cpapd.factors))) .<= 0.0)

Test Failed at /Users/viral/julia/usr/share/julia/stdlib/v1.10/LinearAlgebra/test/cholesky.jl:149
  Expression: all(diff(diag(real(cpapd.factors))) .<= 0.0)

LinearAlgebra/structuredbroadcast (10) |        started at 2023-04-01T23:43:28.313
LinearAlgebra/blas                 (5) |         failed at 2023-04-01T23:43:30.921
Test Failed at /Users/viral/julia/usr/share/julia/stdlib/v1.10/LinearAlgebra/test/blas.jl:678
  Expression: default > 0
   Evaluated: 0 > 0

Test Failed at /Users/viral/julia/usr/share/julia/stdlib/v1.10/LinearAlgebra/test/blas.jl:680
  Expression: BLAS.get_num_threads() === 1
   Evaluated: 0 === 1

LinearAlgebra/symmetric            (6) |   136.01 |   3.62 |  2.7 |   19231.91 |  1740.31
LinearAlgebra/hessenberg           (6) |        started at 2023-04-01T23:43:31.953
LinearAlgebra/svd                 (11) |        started at 2023-04-01T23:43:32.994
LinearAlgebra/matmul               (4) |   140.62 |   4.53 |  3.2 |   21742.65 |  1389.61
LinearAlgebra/eigen                (4) |        started at 2023-04-01T23:43:36.533
LinearAlgebra/structuredbroadcast (10) |    17.92 |   0.77 |  4.3 |    3234.67 |   613.28
LinearAlgebra/tridiag             (10) |        started at 2023-04-01T23:43:46.360
LinearAlgebra/uniformscaling       (8) |    25.05 |   1.04 |  4.2 |    3629.15 |  2428.69
LinearAlgebra/lapack               (8) |        started at 2023-04-01T23:43:46.899
LinearAlgebra/hessenberg           (6) |    16.74 |   0.53 |  3.2 |    2489.17 |  1923.83
LinearAlgebra/lq                   (6) |        started at 2023-04-01T23:43:48.694
LinearAlgebra/lu                   (7) |    41.35 |   1.17 |  2.8 |    6116.89 |  2578.94
LinearAlgebra/adjtrans             (7) |        started at 2023-04-01T23:43:56.532
LinearAlgebra/lapack               (8) |    13.14 |   0.34 |  2.6 |    1572.33 |  2486.33
LinearAlgebra/generic              (8) |        started at 2023-04-01T23:44:00.046
LinearAlgebra/lq                   (6) |    16.15 |   0.49 |  3.0 |    2499.47 |  2104.75
LinearAlgebra/schur                (6) |        started at 2023-04-01T23:44:04.843
LinearAlgebra/eigen                (4) |    28.47 |   1.04 |  3.7 |    4258.50 |  1654.45
LinearAlgebra/bunchkaufman         (4) |        started at 2023-04-01T23:44:05.008
LinearAlgebra/svd                 (11) |    32.77 |   1.21 |  3.7 |    5443.56 |   807.22
LinearAlgebra/givens              (11) |        started at 2023-04-01T23:44:05.893
LinearAlgebra/adjtrans             (7) |    11.40 |   0.70 |  6.1 |    2016.61 |  2931.66
LinearAlgebra/pinv                 (7) |        started at 2023-04-01T23:44:07.945
LinearAlgebra/givens              (11) |     3.25 |   0.15 |  4.5 |     480.30 |  1011.66
LinearAlgebra/factorization       (11) |        started at 2023-04-01T23:44:09.141
LinearAlgebra/factorization       (11) |     1.58 |   0.03 |  1.9 |     250.17 |  1033.64
LinearAlgebra/abstractq           (11) |        started at 2023-04-01T23:44:10.736
LinearAlgebra/pinv                 (7) |     3.67 |   0.24 |  6.4 |     817.49 |  3158.14
LinearAlgebra/ldlt                 (7) |        started at 2023-04-01T23:44:11.615
LinearAlgebra/ldlt                 (7) |     0.31 |   0.00 |  0.0 |      47.32 |  3160.14
LinearAlgebra/abstractq           (11) |         failed at 2023-04-01T23:44:12.844
Test Failed at /Users/viral/julia/usr/share/julia/stdlib/v1.10/LinearAlgebra/test/abstractq.jl:59
  Expression: Q[:, 1:3] == Q.Q[:, 1:3] == (Matrix(Q))[:, 1:3]
   Evaluated: [-0.35953221510956435 0.1965454404197131 0.519592431137261; -0.22582192557737624 -0.6725058409923694 0.6076020971223726; … ; -0.09448452509982379 -0.655284066450569 -0.5071271232591662; -0.5324407150637704 0.28215697975293297 -0.10211308565740215] == [-0.35953221510956435 0.1965454404197131 0.519592431137261; -0.22582192557737624 -0.6725058409923694 0.6076020971223726; … ; -0.09448452509982379 -0.655284066450569 -0.5071271232591662; -0.5324407150637704 0.28215697975293297 -0.10211308565740215] == [-0.35953221510956435 0.1965454404197131 0.5195924311372611; -0.22582192557737624 -0.6725058409923694 0.6076020971223726; … ; -0.09448452509982379 -0.655284066450569 -0.5071271232591662; -0.5324407150637704 0.28215697975293297 -0.10211308565740214]

Test Failed at /Users/viral/julia/usr/share/julia/stdlib/v1.10/LinearAlgebra/test/abstractq.jl:59
  Expression: Q[:, 1:3] == Q.Q[:, 1:3] == (Matrix(Q))[:, 1:3]
   Evaluated: ComplexF64[-0.49837132037257037 - 0.4782120069360776im -0.150972588569905 - 0.28336417274098175im 0.11710459089697636 + 0.24599737894264181im; -0.5297065808419519 - 0.2180434544860795im -0.258256476833034 + 0.09563123480722824im -0.269175148885196 - 0.16302309774541907im; … ; -0.27303350965655987 - 0.2065902594255449im 0.27335708885184995 + 0.3499871631428588im 0.4250421451010471 - 0.06661317167629388im; -0.05787258293172755 - 0.2426695859781627im 0.5018747549963603 + 0.4058519997055557im -0.016960609747618538 - 0.398499749909367im] == ComplexF64[-0.49837132037257037 - 0.4782120069360776im -0.150972588569905 - 0.28336417274098175im 0.11710459089697636 + 0.24599737894264181im; -0.5297065808419519 - 0.2180434544860795im -0.258256476833034 + 0.09563123480722824im -0.269175148885196 - 0.16302309774541907im; … ; -0.27303350965655987 - 0.2065902594255449im 0.27335708885184995 + 0.3499871631428588im 0.4250421451010471 - 0.06661317167629388im; -0.05787258293172755 - 0.2426695859781627im 0.5018747549963603 + 0.4058519997055557im -0.016960609747618538 - 0.398499749909367im] == ComplexF64[-0.49837132037257037 - 0.4782120069360776im -0.150972588569905 - 0.28336417274098175im 0.11710459089697636 + 0.24599737894264181im; -0.5297065808419519 - 0.2180434544860795im -0.258256476833034 + 0.09563123480722824im -0.269175148885196 - 0.16302309774541907im; … ; -0.27303350965655987 - 0.2065902594255449im 0.27335708885184995 + 0.3499871631428588im 0.4250421451010471 - 0.0666131716762939im; -0.05787258293172755 - 0.2426695859781627im 0.5018747549963603 + 0.4058519997055557im -0.016960609747618538 - 0.3984997499093669im]

LinearAlgebra/tridiag             (10) |    28.33 |   0.88 |  3.1 |    4643.44 |  1031.50
LinearAlgebra/generic              (8) |    17.30 |   0.60 |  3.5 |    2544.61 |  2719.42
LinearAlgebra/bunchkaufman         (4) |    12.82 |   0.54 |  4.2 |    1991.96 |  1759.48
LinearAlgebra/addmul               (2) |   191.61 |   4.16 |  2.2 |   22170.09 |  1365.72
LinearAlgebra/triangular           (3) |   220.06 |   7.61 |  3.5 |   33463.90 |  3482.58
LinearAlgebra/schur                (6) |    65.27 |   0.25 |  0.4 |    1307.90 |  2300.78

Test Summary:                       |  Pass  Fail  Broken  Total     Time
  Overall                           | 97516     9      17  97542  3m55.4s
    LinearAlgebra/special           |  3675                 3675  1m24.7s
    LinearAlgebra/bidiag            |  4828                 4828  1m30.8s
    LinearAlgebra/dense             |  8561                 8561  1m55.9s
    LinearAlgebra/diagonal          |  2945                 2945  2m00.1s
    LinearAlgebra/qr                |  4782                 4782    42.0s
    LinearAlgebra/cholesky          |  2506     5           2511    40.3s
    LinearAlgebra/blas              |  1628     2           1630    20.0s
    LinearAlgebra/symmetric         |  3000                 3000  2m16.8s
    LinearAlgebra/matmul            |  1496                 1496  2m21.4s
    LinearAlgebra/structuredbroadcast |   672                  672    18.4s
    LinearAlgebra/uniformscaling    |   446                  446    25.0s
    LinearAlgebra/hessenberg        |   667                  667    16.7s
    LinearAlgebra/lu                |  1380                 1380    41.4s
    LinearAlgebra/lapack            |   818                  818    13.2s
    LinearAlgebra/lq                |  2898                 2898    16.1s
    LinearAlgebra/eigen             |   951                  951    28.5s
    LinearAlgebra/svd               |   606                  606    33.2s
    LinearAlgebra/adjtrans          |   479                  479    11.4s
    LinearAlgebra/givens            |  1991                 1991     3.3s
    LinearAlgebra/factorization     |    85            16    101     1.6s
    LinearAlgebra/pinv              |   500                  500     3.7s
    LinearAlgebra/ldlt              |     8                    8     0.3s
    LinearAlgebra/abstractq         |   104     2            106     2.1s
    LinearAlgebra/tridiag           |  1621                 1621    28.4s
    LinearAlgebra/generic           |   720             1    721    17.3s
    LinearAlgebra/bunchkaufman      |  5689                 5689    12.8s
    LinearAlgebra/addmul            |  5688                 5688  3m12.5s
    LinearAlgebra/triangular        | 38272                38272  3m40.9s
    LinearAlgebra/schur             |   500                  500  1m05.3s
    FAILURE

The global RNG seed was 0x2d0c35f148fbf575674e2e79e883fe44.

Error in testset LinearAlgebra/cholesky:
Test Failed at /Users/viral/julia/usr/share/julia/stdlib/v1.10/LinearAlgebra/test/cholesky.jl:149
  Expression: all(diff(diag(real(cpapd.factors))) .<= 0.0)

Error in testset LinearAlgebra/cholesky:
Test Failed at /Users/viral/julia/usr/share/julia/stdlib/v1.10/LinearAlgebra/test/cholesky.jl:149
  Expression: all(diff(diag(real(cpapd.factors))) .<= 0.0)

Error in testset LinearAlgebra/cholesky:
Test Failed at /Users/viral/julia/usr/share/julia/stdlib/v1.10/LinearAlgebra/test/cholesky.jl:149
  Expression: all(diff(diag(real(cpapd.factors))) .<= 0.0)

Error in testset LinearAlgebra/cholesky:
Test Failed at /Users/viral/julia/usr/share/julia/stdlib/v1.10/LinearAlgebra/test/cholesky.jl:149
  Expression: all(diff(diag(real(cpapd.factors))) .<= 0.0)

Error in testset LinearAlgebra/cholesky:
Test Failed at /Users/viral/julia/usr/share/julia/stdlib/v1.10/LinearAlgebra/test/cholesky.jl:149
  Expression: all(diff(diag(real(cpapd.factors))) .<= 0.0)

Error in testset LinearAlgebra/blas:
Test Failed at /Users/viral/julia/usr/share/julia/stdlib/v1.10/LinearAlgebra/test/blas.jl:678
  Expression: default > 0
   Evaluated: 0 > 0

Error in testset LinearAlgebra/blas:
Test Failed at /Users/viral/julia/usr/share/julia/stdlib/v1.10/LinearAlgebra/test/blas.jl:680
  Expression: BLAS.get_num_threads() === 1
   Evaluated: 0 === 1

Error in testset LinearAlgebra/abstractq:
Test Failed at /Users/viral/julia/usr/share/julia/stdlib/v1.10/LinearAlgebra/test/abstractq.jl:59
  Expression: Q[:, 1:3] == Q.Q[:, 1:3] == (Matrix(Q))[:, 1:3]
   Evaluated: [-0.35953221510956435 0.1965454404197131 0.519592431137261; -0.22582192557737624 -0.6725058409923694 0.6076020971223726; … ; -0.09448452509982379 -0.655284066450569 -0.5071271232591662; -0.5324407150637704 0.28215697975293297 -0.10211308565740215] == [-0.35953221510956435 0.1965454404197131 0.519592431137261; -0.22582192557737624 -0.6725058409923694 0.6076020971223726; … ; -0.09448452509982379 -0.655284066450569 -0.5071271232591662; -0.5324407150637704 0.28215697975293297 -0.10211308565740215] == [-0.35953221510956435 0.1965454404197131 0.5195924311372611; -0.22582192557737624 -0.6725058409923694 0.6076020971223726; … ; -0.09448452509982379 -0.655284066450569 -0.5071271232591662; -0.5324407150637704 0.28215697975293297 -0.10211308565740214]

Error in testset LinearAlgebra/abstractq:
Test Failed at /Users/viral/julia/usr/share/julia/stdlib/v1.10/LinearAlgebra/test/abstractq.jl:59
  Expression: Q[:, 1:3] == Q.Q[:, 1:3] == (Matrix(Q))[:, 1:3]
   Evaluated: ComplexF64[-0.49837132037257037 - 0.4782120069360776im -0.150972588569905 - 0.28336417274098175im 0.11710459089697636 + 0.24599737894264181im; -0.5297065808419519 - 0.2180434544860795im -0.258256476833034 + 0.09563123480722824im -0.269175148885196 - 0.16302309774541907im; … ; -0.27303350965655987 - 0.2065902594255449im 0.27335708885184995 + 0.3499871631428588im 0.4250421451010471 - 0.06661317167629388im; -0.05787258293172755 - 0.2426695859781627im 0.5018747549963603 + 0.4058519997055557im -0.016960609747618538 - 0.398499749909367im] == ComplexF64[-0.49837132037257037 - 0.4782120069360776im -0.150972588569905 - 0.28336417274098175im 0.11710459089697636 + 0.24599737894264181im; -0.5297065808419519 - 0.2180434544860795im -0.258256476833034 + 0.09563123480722824im -0.269175148885196 - 0.16302309774541907im; … ; -0.27303350965655987 - 0.2065902594255449im 0.27335708885184995 + 0.3499871631428588im 0.4250421451010471 - 0.06661317167629388im; -0.05787258293172755 - 0.2426695859781627im 0.5018747549963603 + 0.4058519997055557im -0.016960609747618538 - 0.398499749909367im] == ComplexF64[-0.49837132037257037 - 0.4782120069360776im -0.150972588569905 - 0.28336417274098175im 0.11710459089697636 + 0.24599737894264181im; -0.5297065808419519 - 0.2180434544860795im -0.258256476833034 + 0.09563123480722824im -0.269175148885196 - 0.16302309774541907im; … ; -0.27303350965655987 - 0.2065902594255449im 0.27335708885184995 + 0.3499871631428588im 0.4250421451010471 - 0.0666131716762939im; -0.05787258293172755 - 0.2426695859781627im 0.5018747549963603 + 0.4058519997055557im -0.016960609747618538 - 0.3984997499093669im]

ERROR: LoadError: Test run finished with errors
in expression starting at /Users/viral/julia/test/runtests.jl:93
ViralBShah commented 1 year ago

In the QR tests, it is a difference on the order of eps.

julia> Q[:,1:3] - Matrix(Q)[:,1:3]
5×3 Matrix{Float64}:
 0.0  0.0  -5.55112e-17
 0.0  0.0  -1.38778e-17
 0.0  0.0   0.0
 0.0  0.0   0.0
 0.0  0.0   0.0
ViralBShah commented 1 year ago

The pivoted cholesky failure is reproduced as follows:

julia> using LinearAlgebra, Test

julia> apdh = Float32[3.4291134 -0.61112815 0.8155207 -0.9183448 -0.61426914 1.021518 -0.07258626 -1.9560734 -1.6641214 0.4579194; -0.61112815 3.1250076 -0.44400603 0.97749346 -0.20464422 0.4505959 -1.1918031 -1.0662653 -0.5070573 -0.06222758; 0.8155207 -0.44400603 0.5413656 0.53000593 -0.4061885 0.028969476 -0.45150727 -0.00913016 -0.61976415 -0.17616473; -0.9183448 0.97749346 0.53000593 5.108155 -0.44426316 -0.5661762 -1.5497217 0.012556124 -1.0134082 0.52672774; -0.61426914 -0.20464422 -0.4061885 -0.44426316 2.1013746 -0.0008714596 1.9995844 -0.80452967 0.81173474 1.0536224; 1.021518 0.4505959 0.028969476 -0.5661762 -0.0008714596 2.52233 0.46488014 -0.90563935 -0.44942248 -0.03283302; -0.07258626 -1.1918031 -0.45150727 -1.5497217 1.9995844 0.46488014 3.378583 -0.98002505 0.58866936 0.6275643; -1.9560734 -1.0662653 -0.00913016 0.012556124 -0.80452967 -0.90563935 -0.98002505 3.117008 1.2102712 -1.4144715; -1.6641214 -0.5070573 -0.61976415 -1.0134082 0.81173474 -0.44942248 0.58866936 1.2102712 1.9170706 0.44138947; 0.4579194 -0.06222758 -0.17616473 0.52672774 1.0536224 -0.03283302 0.6275643 -1.4144715 0.44138947 2.4111974]
10×10 Matrix{Float32}:
  3.42911    -0.611128    0.815521    -0.918345   -0.614269     1.02152     -0.0725863  -1.95607     -1.66412    0.457919
 -0.611128    3.12501    -0.444006     0.977493   -0.204644     0.450596    -1.1918     -1.06627     -0.507057  -0.0622276
  0.815521   -0.444006    0.541366     0.530006   -0.406188     0.0289695   -0.451507   -0.00913016  -0.619764  -0.176165
 -0.918345    0.977493    0.530006     5.10815    -0.444263    -0.566176    -1.54972     0.0125561   -1.01341    0.526728
 -0.614269   -0.204644   -0.406188    -0.444263    2.10137     -0.00087146   1.99958    -0.80453      0.811735   1.05362
  1.02152     0.450596    0.0289695   -0.566176   -0.00087146   2.52233      0.46488    -0.905639    -0.449422  -0.032833
 -0.0725863  -1.1918     -0.451507    -1.54972     1.99958      0.46488      3.37858    -0.980025     0.588669   0.627564
 -1.95607    -1.06627    -0.00913016   0.0125561  -0.80453     -0.905639    -0.980025    3.11701      1.21027   -1.41447
 -1.66412    -0.507057   -0.619764    -1.01341     0.811735    -0.449422     0.588669    1.21027      1.91707    0.441389
  0.457919   -0.0622276  -0.176165     0.526728    1.05362     -0.032833     0.627564   -1.41447      0.441389   2.4112

julia> cpapd = cholesky(apdh, RowMaximum())
CholeskyPivoted{Float32, Matrix{Float32}, Vector{Int64}}
U factor with rank 10:
10×10 UpperTriangular{Float32, Matrix{Float32}}:
 2.26012  -0.406325   0.432496  -0.68568   -0.250507  -0.196566  -0.448386    0.0055555   0.234503     0.233053
  ⋅        1.80666   -0.240994  -0.19439    0.509079  -0.384212  -1.02195    -1.08145     0.504138     0.305877
  ⋅         ⋅         1.69702   -0.555147   0.401659  -0.125056  -0.329646   -0.78331    -0.24981     -0.052626
  ⋅         ⋅          ⋅         1.60077    0.384224   1.07492   -0.0627441  -1.01282    -0.207023     0.510761
  ⋅         ⋅          ⋅          ⋅         1.3753    -0.158      0.0836318   0.254541    0.00796201  -0.22197
  ⋅         ⋅          ⋅          ⋅          ⋅         0.847978   0.436789   -0.221684    0.0308417    0.738552
  ⋅         ⋅          ⋅          ⋅          ⋅          ⋅         0.601048   -0.230679   -0.181164     0.946928
  ⋅         ⋅          ⋅          ⋅          ⋅          ⋅          ⋅          0.375373    0.24624     -0.453482
  ⋅         ⋅          ⋅          ⋅          ⋅          ⋅          ⋅           ⋅          0.180238    -0.167688
  ⋅         ⋅          ⋅          ⋅          ⋅          ⋅          ⋅           ⋅           ⋅           0.523937
permutation:
10-element Vector{Int64}:
  4
  1
  2
  7
  6
  5
  9
  8
  3
 10

julia> @test all(diff(diag(real(cpapd.factors))).<=0.) # diagonal should be non-increasing
Test Failed at REPL[110]:1
  Expression: all(diff(diag(real(cpapd.factors))) .<= 0.0)

ERROR: There was an error during testing

Any ideas? @amontoison @dkarrasch

ViralBShah commented 1 year ago

@staticfloat I believe it is fine to merge this PR.

amontoison commented 1 year ago

Any ideas? @amontoison @dkarrasch

I highly suspect a bug in the new BLAS / LAPACK routines of Apple Accelerate. I have a different result with OpenBLAS and Intel MKL.

julia> cpapd = cholesky(apdh, RowMaximum())
CholeskyPivoted{Float32, Matrix{Float32}, Vector{Int64}}
U factor with rank 10:
10×10 UpperTriangular{Float32, Matrix{Float32}}:
 2.26012  -0.406325   0.432496  -0.68568    0.233053  -0.250507  -0.196566   -0.448386    0.0055555   0.234503
  ⋅        1.80666   -0.240994  -0.19439    0.305877   0.509079  -0.384212   -1.02195    -1.08145     0.504138
  ⋅         ⋅         1.69702   -0.555147  -0.052626   0.401659  -0.125056   -0.329646   -0.78331    -0.24981
  ⋅         ⋅          ⋅         1.60077    0.510761   0.384224   1.07492    -0.0627441  -1.01282    -0.207023
  ⋅         ⋅          ⋅          ⋅         1.4141    -0.21588    0.467681    0.617479   -0.430583   -0.206794
  ⋅         ⋅          ⋅          ⋅          ⋅         1.35825   -0.0856505   0.182823    0.1893     -0.0248058
  ⋅         ⋅          ⋅          ⋅          ⋅          ⋅         0.7197      0.116784   -0.014745    0.166019
  ⋅         ⋅          ⋅          ⋅          ⋅          ⋅          ⋅          0.361517    0.0519991   0.0500296
  ⋅         ⋅          ⋅          ⋅          ⋅          ⋅          ⋅           ⋅          0.289647    0.155059
  ⋅         ⋅          ⋅          ⋅          ⋅          ⋅          ⋅           ⋅           ⋅          0.17166
permutation:
10-element Vector{Int64}:
  4
  1
  2
  7
 10
  6
  5
  9
  8
  3
ViralBShah commented 1 year ago

@amontoison That is what I observed too with openblas, and I was just wondering if we are doing something non-standard or making assumptions in the way we call those routines right now. Great idea to check with MKL, because that suggests that it really is a bug.

staticfloat commented 1 year ago

The one thing I want to do before merging this PR is to provide sane fallback behavior for the set/get threads API when the underlying library has no concept of threads. My plan is to make set threads do nothing, and get threads always return 1.

ViralBShah commented 1 year ago

Does someone here know what LAPACK call is the culprit for the wrong pivoted cholesky results? Is it xPSTRF?

amontoison commented 1 year ago

Yes, it seems that it's pstrf.

ViralBShah commented 1 year ago

Do the Intel macs also have the updated ILP64 BLAS and LAPACK?

ctkelley commented 1 year ago

Yes. @staticfloat 's instructions worked fine on my 1999 Intel Mac.

staticfloat commented 1 year ago

I'm reporting the dpstrf bug to Apple filed as FB 12098971, easy reproducer available here: https://gist.github.com/staticfloat/494272e86b44a3e2d31de898473b3c64

ViralBShah commented 1 year ago

I suspect that the old LAPACK release that Apple used to ship (LAPACK 3.1 IIRC) didn't even have dpstrf. So may be ok to not worry about it.

I assume the plan for now should be for LBT to support all the new stuff, but Julia to default to openblas, with users opting into using the new capabilities through AppleAccelerate.jl (like we do MKL.jl).

staticfloat commented 1 year ago

I suspect that the old LAPACK release that Apple used to ship (LAPACK 3.1 IIRC) didn't even have dpstrf. So may be ok to not worry about it.

The old LAPACK release did have dsptrf_, and it calculated it correctly. So it appears to be a regression in the new LAPACK.

I assume the plan for now should be for LBT to support all the new stuff, but Julia to default to openblas, with users opting into using the new capabilities through AppleAccelerate.jl (like we do MKL.jl).

Yes. I think AppleAccelerate.jl should incorporate some LBT invocations similar to what I posted in this thread, so that it can load the ILP64 and new LP64-LAPACK bindings instead of the old ones. But that can only happen after this new release is cut.

amontoison commented 1 year ago

Will we have the release 5.6.0 of LBT with Julia 1.9?

ViralBShah commented 1 year ago

We should open the PR on Julia master and mark it for backporting. Even if not 1.9, I think we can have it in 1.9.1.

staticfloat commented 1 year ago

Apple got back to me; they have identified the issue and are trying to get a fix into a future macOS version.

amontoison commented 1 year ago

We should open the PR on Julia master and mark it for backporting. Even if not 1.9, I think we can have it in 1.9.1.

Thanks @ViralBShah! I'm working with the HSL team to release of JuliaHSL package with a precompiled HSL_jll inside (we can't precompile it with Yggdrasil because of the academic license). It convinces us to precompile HSL_jll with LBT for Julia ≥ 1.9.

ViralBShah commented 1 year ago

Wouldn't the licensing issue be the same if inside HSL.jl vs BB? Perhaps the right thing is for them to serve the JLL through a separate registry.