Use dictionary of keys sparse storage

mtfishman commented 3 years ago

This PR is a work in progress for changing the block-offset list from a sorted Vector to a Dict, similar to https://github.com/ITensor/ITensor/pull/336. The goal is to speed up a variety of block sparse operations, since Dict has O(1) lookup (instead of the sorted Vector of block offsets, which has O(log(N)) lookup for N blocks). This was shown to make block sparse operations in the C++ version of ITensor about 10% faster, and we should probably just merge that PR.

Most tests are passing, there are a few more bugs to work out but over all this was a pretty easy change to make. I had tried to design NDTensors to be pretty agnostic about the storage but there were certain assumptions baked into the code about using a Vector, like linear indexing. However, over all the code has become simpler, and hopefully it will be faster as well!

emstoudenmire commented 3 years ago

I didn’t know about Dictionaries.jl. Sounds like a good thing to use here

mtfishman commented 3 years ago

Alright, this PR is ready to go. All tests are passing with the new dictionary block-offset design.

Here is a timing (using 4 BLAS threads) for a single sweep of the 1D S=1 Heisenberg model for N=100 sites:

After sweep 1 energy=-138.940086143521 maxlinkdim=600 time=37.533

while on master:

After sweep 1 energy=-138.940086143520 maxlinkdim=600 time=41.112

Not a huge speedup (around 9%), though I wasn't expecting anything major for this example since it was already mostly dominated by permutations, BLAS, and LAPACK.

Here are timings for hybrid real space-momentum space Hubbard model calculations on a 6 x 3 cylinder (also with 4 BLAS threads):

After sweep 1 energy=-5.839513403967 maxlinkdim=91 time=0.948
After sweep 2 energy=-12.435902968857 maxlinkdim=200 time=8.676
After sweep 3 energy=-13.313706521961 maxlinkdim=400 time=16.520
After sweep 4 energy=-13.520903592397 maxlinkdim=800 time=13.219
After sweep 5 energy=-13.633916035656 maxlinkdim=2000 time=20.778
After sweep 6 energy=-13.693184708837 maxlinkdim=2999 time=30.224
After sweep 7 energy=-13.711966030239 maxlinkdim=3000 time=34.462
After sweep 8 energy=-13.715767489572 maxlinkdim=3000 time=36.361
After sweep 9 energy=-13.716428672323 maxlinkdim=3000 time=36.913
After sweep 10 energy=-13.716542905483 maxlinkdim=2999 time=35.022

while on master:

After sweep 1 energy=-5.839513403963 maxlinkdim=91 time=1.174
After sweep 2 energy=-12.435902968869 maxlinkdim=200 time=9.990
After sweep 3 energy=-13.313706521970 maxlinkdim=400 time=18.359
After sweep 4 energy=-13.520903592410 maxlinkdim=800 time=15.075
After sweep 5 energy=-13.633916035668 maxlinkdim=2000 time=22.664
After sweep 6 energy=-13.693184708844 maxlinkdim=2999 time=33.091
After sweep 7 energy=-13.711966030241 maxlinkdim=3000 time=37.860
After sweep 8 energy=-13.715767489573 maxlinkdim=3000 time=38.694
After sweep 9 energy=-13.716428672323 maxlinkdim=3000 time=40.026
After sweep 10 energy=-13.716542905483 maxlinkdim=2999 time=38.636

so again around a 9% speedup. These speedups are very similar to the ones seen with the same design change in C++ ITensor: https://github.com/ITensor/ITensor/pull/336.

emstoudenmire commented 3 years ago

Good to see those timings: 9% seems like a lot to me for a code that's already so optimized to begin with.

mtfishman commented 3 years ago

To be fair, some of the block sparse code was not as optimized as it could have been (for example the block sparse addition was pretty simplistic). But I saw large speedups with this design for very sparse contractions, so I think the speedups here are beyond what would have been possible with the previous design.

ITensor / NDTensors.jl

Use dictionary of keys sparse storage #54