cyclops-community / ctf

Cyclops Tensor Framework: parallel arithmetic on multidimensional arrays
Other
199 stars 53 forks source link

Efficiency comparisons with einsum and opt-einsum and how to utilize symmetry? #126

Closed ghost closed 2 years ago

ghost commented 3 years ago

I have the following code

import numpy as np
import time
import ctf
import opt_einsum

N = 30

T1 = np.random.rand(N, N, N, N)
T2 = np.random.rand(N, N, N, N) 
T3 = np.zeros((N,N,N,N))  

# symmetrize T1/T2
for a in range(N):
  for b in range(a+1):     #include a
    for c in range(N):
      for d in range(N):
        T1[b,a,d,c] = T1[a,b,c,d]
        T2[b,a,d,c] = T2[a,b,c,d]        

# exam symmetry        
for a in range(N):
  for b in range(N):
    for c in range(N):
      for d in range(N):
        if abs(T1[b,a,d,c] - T1[a,b,c,d] ) > 1.e-5:  
          print('T1', a,b,c,d, T1[b,a,d,c], T1[a,b,c,d])
        if abs(T2[b,a,d,c] - T2[a,b,c,d] ) > 1.e-5:  
          print('T2', a,b,c,d, T2[b,a,d,c], T2[a,b,c,d])

start = time.time()
out = opt_einsum.contract('ABab,abCD->ABCD', T1, T2)#, optimize="dynamic-programming")
end = time.time()

print('time0',end - start)
print(out[0,1,2,3],out[1,0,3,2])

start = time.time()
out = np.einsum('ABab,abCD->ABCD', T1, T2)
end = time.time()

print('time1',end - start)
print(out[0,1,2,3],out[1,0,3,2])

start = time.time()
out = np.einsum('ABab,abCD->ABCD', T1, T2, optimize=True)
end = time.time()

print('time2',end - start)
print(out[0,1,2,3],out[1,0,3,2])

start = time.time()
out3 = ctf.einsum('ABab,abCD->ABCD', T1, T2)
end = time.time()

print('time3',end - start)
print(out[0,1,2,3],out[1,0,3,2])

(1) it seems for various N, einsum with optimize=True is always faster than ctf, is there anything like optimize=True in ctf?

E.g., N=35,

time opt_einsum 2.7050588130950928 time einsum 14.024635314941406 time einsum optimize 1.4009654521942139 time ctf 5.522147178649902

(2) by T1[a,b,c,d] = T1[b,a,d,c] and T2[a,b,c,d] = T2[b,a,d,c], if T3[ABCD] = \sum_ab T1[A,B,a,b] * T2[a,b,C,D], then T3[A,B,C,D] = T3[B,A,D,C]. How to utilize the symmetry here? I found some entries about symmetry in https://solomon2.web.engr.illinois.edu/ctf/index.html, but not in python.

solomonik commented 3 years ago

@Montagna2023 for numpy 'optimize=True' toggles the choice of algorithm to find the best order of contractions, so it should not really affect things for the contraction of two tensors. If you see a difference between the first two numpy execution times, I would wonder if its due to warm-up/cache-effects. CTF would need to convert from numpy array to CTF tensor, which also at least requires an extra copy. While CTF can also have other overheads relative to numpy in sequential/threaded-MKL execution (also needs to be correctly linked to MKL and not reference BLAS), once the tensors are reasonably large it should be competitive.

To utilize symmetry, in python you can define

B = ctf.tensor((4,4,4,4),sym=[ctf.SYM.NS,ctf.SYM.NS,ctf.SYM.SY,ctf.SYM.NS])

or ctf.SYM.AS for skew(anti)-symmetric. Some python end-functions are not being tested with symmetric tensors though, so bugs could arise (would recommend to get any code working without symmetries first, then add them in subsequently). Once a tensor is defined to be symmetric, you need only input the unique entries, e.g., via write(), and if you do B.i("ij")<<A.i("ij") with B symmetric and A nonsymmetric (or any einsum where output is symmetric and symmetry is not already present in one of the tensors in the input expression), B = A + A.T will be computed. For the contraction you are doing, if symmetries are constrained to (a,b), (c,d), and (C,D), CTF should be able to save the factor of 8 in flops (may not be quite that much faster though) and will not do symmetrization since it is preserved in the expression. To make sure the output symmetry is captured, you would also want to use either the A.i() syntax, or ctf.einsum("...",...,out=output_tensor).

Hope this is helpful.

ghost commented 3 years ago

@solomonik , Thank you so much! I am 200% appreciate your help. For N=30 or 50 in the above code, time for ctf is similar to einsum with optimize =True and opt-einsum,

But I am too stupid to understand a couple of points.

  1. From https://solomon2.web.engr.illinois.edu/ctf/,

A CTF::Tensor is an arbitrary-dimensional distributed array, which can be defined as usual on any CTF::World. The symmetry is specified via an array of integers (elements of enum {NS–nonsymmetric, SY–symmetric, AS–antisymmetric, and SH–symmetric hollow}) of length equal to the number of dimensions, with the entry i of the symmetric array specifying the symmetric relation between index i and index i+1.

If the symmetry denotes to between index i and i+1, for a rank-4 tensor ctf.tensor((4,4,4,4),sym=[ctf.SYM.NS,ctf.SYM.NS,ctf.SYM.SY,ctf.SYM.NS]), I can imagine the first ctf.SYM.NS, for indices 01, then 12, 23, but what is the role for the 4th one, ctf.SYM.NS?

  1. If I would like to realize a symmetry, only for a permutation T[A,B,C,D] = T[B,A,D,C], not T[A,B,C,D] and T[B,A,C,D] (suppose no that symmetry between only permute the first two or the last two), what would be the input? I tried to implement your suggestion by using
A = ctf.tensor([N,N,N,N],sym=[ctf.SYM.NS,ctf.SYM.NS,ctf.SYM.SY,ctf.SYM.NS])
B = ctf.tensor([N,N,N,N],sym=[ctf.SYM.NS,ctf.SYM.NS,ctf.SYM.SY,ctf.SYM.NS])
C = ctf.tensor([N,N,N,N],sym=[ctf.SYM.NS,ctf.SYM.NS,ctf.SYM.SY,ctf.SYM.NS])

A = T1
B = T2

start = time.time()
C = ctf.einsum('ABab,abCD->ABCD', A, B)
end = time.time()

print('time ctf symmetrize',end - start)
print(C[0,1,2,3],C[1,0,3,2]) 

and got the same result. (setting out = C leads to other numerical results)

But concerning the timing, it is similar to directly use ctf.einsum without introducing ctf.tensor with the symmetries. Maybe there is not enough symmetries to see the speed up. I hope using symmetry could be a part of https://solomonik.cs.illinois.edu/demos/CTF_introductory_demo.html

Thank you once again for your time.

solomonik commented 3 years ago

Regarding (1), the last symmetry label should simply always be NS for CTF, even if the tensor is fully symmetric.

Regarding (2), I am aware that this symmetry appears in the 2-electron integral tensor, but there is no explicit way to express that coupled symmetry in CTF. One could symmetrize and antisymmetrize T to break it up into two tensors that sum to T and both have independent symmetries among each pair of indices.

Indeed you may not see significant speed-up from using symmetries in CTF. Not sure for what size tensors you test, but the larger the more likely you would actually see the 2X speed-up. In some cases, symmetries may require unpacking/packing the data and actually cause overhead. One thing using symmetries does save is the amount of memory needed to store the tensor (additional memory will be used at contraction-time for the tensors that are being contracted though).

ghost commented 2 years ago

For N=50,

with

A = ctf.tensor([N,N,N,N],sym=[ctf.SYM.SY,ctf.SYM.NS,ctf.SYM.SY,ctf.SYM.NS])
B = ctf.tensor([N,N,N,N],sym=[ctf.SYM.SY,ctf.SYM.NS,ctf.SYM.NS,ctf.SYM.NS])
C = ctf.tensor([N,N,N,N],sym=[ctf.SYM.SY,ctf.SYM.NS,ctf.SYM.NS,ctf.SYM.NS])

A = T1
B = T2

start = time.time()
C = ctf.einsum('ABab,abCD->ABCD', A, B)
end = time.time()
print('time ctf symmetrize',round(end - start,2) )

the performance is so far

time opt_einsum 3 time einsum 26.8 time einsum optimize 2.83 time ctf 9.55 time ctf symmetrize 9.29.

Maybe I linked with blas dynamically. Is there any function like numpy.show_config() (in my case openblas) can show which BLAS has been linked with ctf? Hence, I can make sure if the BLAS is configurated with ctf properly?

EDITED: I found two ways to speed up the python-ctf calculations:

  1. Use static library in ./configure, in my PC, using ./configure LIB_PATH="-L/opt/OpenBLAS/lib" LIBS="-lopenblas", it gets faster than dynamic library in my case. But not much speed up by symmetry.
  2. use ctf-array than numpy-array, e.g., A = ctf.from_nparray(T1) B = ctf.from_nparray(T2) supplement the above code (sorry that spitted segmentally). Then, symmetry could help.

My whole test code is

import numpy as np
import time
import ctf
import opt_einsum
import statistics 
#import scipy.stats as st
import scipy.stats

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
# from https://stackoverflow.com/questions/15033511/compute-a-confidence-interval-from-sample-data
def mean_confidence_interval(data, var_name, unit, confidence=0.95):
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), scipy.stats.sem(a)
    h = se * scipy.stats.t.ppf((1 + confidence) / 2., n-1)

    print(var_name, round(m, 2), "\u00B1",  round(h, 2), unit )

    #return round(m, 2), round(h, 2)

N = 50
n_times = 10
n_digit = 2

T1_0 = np.random.rand(N, N, N, N)
T2_0 = np.random.rand(N, N, N, N)

T1_1 = np.zeros((N, N, N, N))
T2_1 = np.zeros((N, N, N, N)) 

T1 = np.zeros((N, N, N, N))
T2 = np.zeros((N, N, N, N)) 

T3 = np.zeros((N,N,N,N))  

# symmetrize T1/T2, such that the T[a,b,c,d] = T[b,a,c,d] = T[a,b,d,c]
for a in range(N):
  for b in range(N):     #include a
    for c in range(N):
      for d in range(N):
        T1_1[a,b,c,d] = ( T1_0[a,b,c,d] + T1_0[b,a,c,d] ) * 0.5
        T2_1[a,b,c,d] = ( T2_0[a,b,c,d] + T2_0[b,a,c,d] ) * 0.5          

for a in range(N):
  for b in range(N):     #include a
    for c in range(N):
      for d in range(N):
        T1[a,b,c,d] = ( T1_1[a,b,c,d] + T1_1[a,b,d,c] ) * 0.5
        T2[a,b,c,d] = ( T2_1[a,b,c,d] + T2_1[a,b,d,c] ) * 0.5          

# exam symmetry        
for a in range(N):
  for b in range(N):
    for c in range(N):
      for d in range(N):
        if abs(T1[b,a,c,d] - T1[a,b,c,d] ) > 1.e-5:  
          print('T1', a,b,c,d, T1[b,a,c,d], T1[a,b,c,d])
        if abs(T2[b,a,c,d] - T2[a,b,c,d] ) > 1.e-5:  
          print('T2', a,b,c,d, T2[b,a,c,d], T2[a,b,c,d])

list_time = []

for i in range(n_times):
    start = time.time()
    out = opt_einsum.contract('ABab,abCD->ABCD', T1, T2)#, optimize="dynamic-programming")
    end = time.time()

   # print('time opt_einsum',round(end - start, n_digit))
    list_time.append(end - start)

#print('time opt_einsum', round(statistics.mean(list_time),2), "\u00B1", round(statistics.pstdev(list_time),2) )
#print(out[0,1,2,3],out[1,0,3,2])

#print( scipy.stats.t.interval(alpha=0.95, df=len(list_time)-1, loc=np.mean(list_time), scale=scipy.stats.sem(list_time))  )
#print( 
mean_confidence_interval( list_time, 'opt_einsum', 's' )

list_time = []
for i in range(n_times):
    start = time.time()
    out = np.einsum('ABab,abCD->ABCD', T1, T2)
    end = time.time()
    list_time.append(end - start)

#print('time einsum', round(end - start, n_digit) )
#print(out[0,1,2,3],out[1,0,3,2])
mean_confidence_interval( list_time, 'einsum', 's' )

list_time = []
for i in range(n_times):
    start = time.time()
    out = np.einsum('ABab,abCD->ABCD', T1, T2, optimize=True)
    end = time.time()
    list_time.append(end - start)

mean_confidence_interval( list_time, 'einsum, optimize', 's' )

#print('time einsum optimize', round(end - start, n_digit) )
#print(out[0,1,2,3],out[1,0,3,2])
list_time = []
for i in range(n_times):
    start = time.time()
    out3 = ctf.einsum('ABab,abCD->ABCD', T1, T2)
    end = time.time()
    list_time.append(end - start)

mean_confidence_interval( list_time, 'ctf', 's' )

#print('time ctf',round(end - start, n_digit) )
#print(out3[0,1,2,3],out3[1,0,3,2])

A = ctf.tensor([N,N,N,N],sym=[ctf.SYM.SY,ctf.SYM.NS,ctf.SYM.SY,ctf.SYM.NS])
B = ctf.tensor([N,N,N,N],sym=[ctf.SYM.SY,ctf.SYM.NS,ctf.SYM.SY,ctf.SYM.NS])
C = ctf.tensor([N,N,N,N],sym=[ctf.SYM.SY,ctf.SYM.NS,ctf.SYM.SY,ctf.SYM.NS])

#A.fill_random()
#B.fill_random()

#print(type(A))

A = ctf.from_nparray(T1)
B = ctf.from_nparray(T2)
#print(type(A))

list_time = []
for i in range(n_times):
    start = time.time()
    C = ctf.einsum('ABab,abCD->ABCD', A, B)
    end = time.time()
    list_time.append(end - start)

mean_confidence_interval( list_time, 'ctf-symmetrize', 's' )

#print('time ctf symmetrize',round(end - start, n_digit) )
#print(C[0,1,2,3],C[1,0,3,2])

the result is

opt_einsum 1.92 ± 0.14 s
einsum 14.42 ± 2.28 s
einsum, optimize 2.11 ± 0.56 s
ctf 2.15 ± 0.07 s
ctf-symmetrize 1.37 ± 0.01 s

For comparison purpose, my numpy.show_config() is (still don't know why use '/usr/local/lib' cannot work in ./configure)

>>> numpy.show_config()
blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/usr/local/lib']
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/usr/local/lib']
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/usr/local/lib']
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/usr/local/lib']
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2
    not found = AVX512F,AVX512CD,AVX512_KNL,AVX512_KNM,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL
>>>