SVD using the divide and conquer algorithm might fail for certain tensors.

Gertian commented 8 months ago

Currently the default algorithm for singular value decomposition is a divide and conquer algorithm.

From an optimization perspective this is great. Indeed, for large tensors such algorithms outperform the standard method Householder(?) method but they are known to be less stable and prone to crashing.

One specific stack trace for such a crash in MPSKit might be : but I suspect there are a multiple a ways to reproduce this crash.

In my particular case the issue was patched by changing projectisometric! from TensorKitManifolds.jl into :

function projectisometric!(W::AbstractTensorMap; alg=Polar())
    if alg isa TensorKit.Polar || alg isa TensorKit.SDD
        foreach(blocks(W)) do (c, b)
            #sometimes the divide and conquer algorithm used in polarsdd fails and gives LAPACK error 1
            #in that case, we try the standard SVD algorithm which is slightly more expensive but more robust
            #in case that fails I'll be a sad camper... (spoiler, it doesn't)
            b_backup = copy(b)
            try 
                return _polarsdd!(b)
            catch
                b = b_backup
                return _polarsvd!(b)
            end
        end
    elseif alg isa TensorKit.SVD
        foreach(blocks(W)) do (c, b)
            return _polarsvd!(b)
        end
    elseif alg isa PolarNewton
        foreach(blocks(W)) do (c, b)
            return _polarnewton!(b)
        end
    else
        throw(ArgumentError("unkown algorithm for projectisometric!: alg = $alg"))
    end
    return W
end

so that the slower, more stable method is used whenever the initial attempt fails.

Clearly this solution is quite (one could even say very) messy and slow (since it has to make a backup of the to-be-svd'd tensor) so that much improvement is still possible. In particular MPSKit should offer the option to choose different SVD decompositions on default. And perhaps there is a smarter way to implement the try cath loop without the necessity of a backup ?

maartenvd commented 8 months ago

This has been a real pain in the past, so at some point I decided to switch over to the slower-but-stable method in mpskit (at some point we should probably make this configurable on the algorithm level. You could then even provide a fallback that tries the fast method first, and only relies on the fallback once it crashes). I think the crash you see is not mpskit's fault, but rather TensorKitManifold's! https://github.com/Jutho/TensorKitManifolds.jl/blob/7391fe01a4c60eb06ef8fcf2a3b4849bffdbce1e/src/grassmann.jl#L172

Out of curiousity, what happens if you continue polarsvd! on a failed polarsdd! block? Without the copy?

Op di 23 jan 2024 om 15:38 schreef Gertian @.***>:

Currently the default algorithm for singular value decomposition is a divide and conquer algorithm.

From an optimization perspective this is great. Indeed, for large tensors such algorithms outperform the standard method Householder(?) method but they are known to be less stable and prone to crashing https://discourse.julialang.org/t/lapackexception-1-while-svd-but-not-svdvals/23787 .

One specific stack trace for such a crash in MPSKit might be : image.png (view on web) https://github.com/maartenvd/MPSKit.jl/assets/11559970/f465191f-b4d7-420d-8cf7-2185a37a836c but I suspect there are a multiple a ways to reproduce this crash.

In my particular case the issue was patched by changing projectisometric! into :

function projectisometric!(W::AbstractTensorMap; alg=Polar()) if alg isa TensorKit.Polar || alg isa TensorKit.SDD foreach(blocks(W)) do (c, b)

sometimes the divide and conquer algorithm used in polarsdd fails and gives LAPACK error 1
        #in that case, we try the standard SVD algorithm which is slightly more expensive but more robust
        #in case that fails I'll be a sad camper... (spoiler, it doesn't)
        b_backup = copy(b)
        try
            return _polarsdd!(b)
        catch
            b = b_backup
            return _polarsvd!(b)
        end
    end
elseif alg isa TensorKit.SVD
    foreach(blocks(W)) do (c, b)
        return _polarsvd!(b)
    end
elseif alg isa PolarNewton
    foreach(blocks(W)) do (c, b)
        return _polarnewton!(b)
    end
else
    throw(ArgumentError("unkown algorithm for projectisometric!: alg = $alg"))
end
return Wend
so that the slower, more stable method is used whenever the initial attempt fails.

Clearly this solution is quite (one could even say very) messy and slow (since it has to make a backup of the to-be-svd'd tensor) so that much improvement is still possible.

— Reply to this email directly, view it on GitHub https://github.com/maartenvd/MPSKit.jl/issues/109, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJKVCW5PUTMVPUAB7VGYQ3YP7DM3AVCNFSM6AAAAABCHDN2BCVHI2DSMVQWIX3LMV43ASLTON2WKOZSGA4TMMRTGIZTEOA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Jutho commented 8 months ago

I suggested Gertian to post this here rather than in TensorKitManifolds, as I assumed that it would also affect other algorithms. I did not know that you opted for GESVD globally throughout MPSKit.

Anyway, I think we indeed want to be able to configure this, which requires changes both here and in TensorKitManifolds.jl, cause there I started supporting both with the _polar... methods, but then this is not continued through the higher level retract functions.

Jutho commented 8 months ago

The problem with the fallback strategy is that GESDD might already ruin your matrix before crashing, so you need to make a copy of the original matrix just in case.

maartenvd commented 8 months ago

I hoped gesdd simply failed to converge, and that the data would be salvageable. If it isn't, then the overhead of a copy combined with the slow garbage collector might make trying out the fast method first not worth it...

I agree that we want to make it configurable. I stopped using polar decompositions because of this instability, so the only places where mpskit still does svd's should be in two-site schemes and in changebonds, in both places the svd algorithm can become a field in the algorithm struct. If tensorkitmanifolds provides some way of specifying the polar/svd algorithm, then fixing this crash will be a small change to the grassmann code.

Has there been any progress on optimkit's linesearch? We might combine the update with rewriting the grassmann code again :)

Op di 23 jan 2024 om 15:58 schreef Jutho @.***>:

The problem with the fallback strategy is that GESDD might already ruin your matrix before crashing, so you need to make a copy of the original matrix just in case.

— Reply to this email directly, view it on GitHub https://github.com/maartenvd/MPSKit.jl/issues/109#issuecomment-1906230744, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJKVCRU4S5VNHSIYPRWVU3YP7F27AVCNFSM6AAAAABCHDN2BCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBWGIZTANZUGQ . You are receiving this because you commented.Message ID: <maartenvd/MPSKit .@.***>

Gertian commented 8 months ago

I quickly tested the amount of overhead and it seems to be the case that my first bug-fix is quite a bit slower than just defaulting to the stable version.

personally I think it's more logical to have the stable version as a default since end-users that are less familiar with numerics might be scared away by a random crash like the one I experienced.

lkdvos commented 8 months ago

I think I agree with setting the default values as stable as possible, and afterwards trying to be able to incorporate the option of having the faster version. The time gained in using a faster algorithm is easily lost by having to deal with crashes, especially if you have no idea why or what is going on.

maartenvd commented 8 months ago

so plan of action - we first change tensorkitmanifolds to allow passing in the svd algorithm, and then change mpskit to pick the stable one?

Op do 25 jan 2024 om 13:39 schreef Lukas @.***>:

I think I agree with setting the default values as stable as possible, and afterwards trying to be able to incorporate the option of having the faster version. The time gained in using a faster algorithm is easily lost by having to deal with crashes, especially if you have no idea why or what is going on.

— Reply to this email directly, view it on GitHub https://github.com/maartenvd/MPSKit.jl/issues/109#issuecomment-1910121383, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJKVCVFHT3QXH4GTL75AYTYQJHA3AVCNFSM6AAAAABCHDN2BCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJQGEZDCMZYGM . You are receiving this because you commented.Message ID: <maartenvd/MPSKit .@.***>

Jutho commented 8 months ago

So the retract methods of TensorKitManifolds do actually have an alg keyword, but it is unused for Grassmann and Unitary. This was supposed to be used for different retraction schemes (it is used for Stiefel).

What I am confused by though, is that I thought that the retraction schemes were actually creating a new isometric matrix automatically, and that the projectisometric which used the polar decomposition was just there to get rid of rounding errors. So this means that the matrix would already be close to isometric, and therefore be very well conditioned.

Of course, this also means that its singular values would all be very close to one, and maybe this is what is giving the DivideAndConquer scheme a hard time.

Gertian commented 2 months ago

We should just put U, S, V, = tsvd(Δ.Z, alg = SVD()) in https://github.com/Jutho/TensorKitManifolds.jl/blob/7391fe01a4c60eb06ef8fcf2a3b4849bffdbce1e/src/grassmann.jl#L59 to fix this issue. (At least in my particular case)

Gertian commented 1 month ago

I suggested Gertian to post this here rather than in TensorKitManifolds, as I assumed that it would also affect other algorithms. I did not know that you opted for GESVD globally throughout MPSKit.

Anyway, I think we indeed want to be able to configure this, which requires changes both here and in TensorKitManifolds.jl, cause there I started supporting both with the _polar... methods, but then this is not continued through the higher level retract functions.

Actually it seems like you partially started supporting it... I'll try to complete what you started here. I'm not familiar with the entire code so I'll just follow the instances of the above stacktrace and make a PR. If there are more places that need change we can then change this PR.

Gertian commented 1 month ago

Ok I think I did the neccecairy changes for this to work in : https://github.com/QuantumKitHub/MPSKit.jl/tree/SVD/SDD_fix and https://github.com/Gertian/TensorKitManifolds.jl

In TensorKitManifolds I simply made it so that alg actually functions. @Jutho had already started this but didn't finish.

In MPSKit, I added an additional field svd_arg to the GradientGrassmann struct which can either be SVD/SDD (I didn't implement this to be the only two options because I'm not sure what the existing machinery for this is) . These arguments are then passed to TensorKitManifolds whenever an SVD is performed.

lkdvos commented 1 month ago

So, in conclusion, the new approach will have the default algorithm as SVD (stable/slower), already in TensorKitManifolds, such that from the MPSKit side of things nothing has to happen.

QuantumKitHub / MPSKit.jl

SVD using the divide and conquer algorithm might fail for certain tensors. #109

sometimes the divide and conquer algorithm used in polarsdd fails and gives LAPACK error 1