Closed ghost closed 3 years ago
@Montagna2023 for numpy 'optimize=True' toggles the choice of algorithm to find the best order of contractions, so it should not really affect things for the contraction of two tensors. If you see a difference between the first two numpy execution times, I would wonder if its due to warm-up/cache-effects. CTF would need to convert from numpy array to CTF tensor, which also at least requires an extra copy. While CTF can also have other overheads relative to numpy in sequential/threaded-MKL execution (also needs to be correctly linked to MKL and not reference BLAS), once the tensors are reasonably large it should be competitive.
To utilize symmetry, in python you can define
B = ctf.tensor((4,4,4,4),sym=[ctf.SYM.NS,ctf.SYM.NS,ctf.SYM.SY,ctf.SYM.NS])
or ctf.SYM.AS for skew(anti)-symmetric. Some python end-functions are not being tested with symmetric tensors though, so bugs could arise (would recommend to get any code working without symmetries first, then add them in subsequently). Once a tensor is defined to be symmetric, you need only input the unique entries, e.g., via write()
, and if you do B.i("ij")<<A.i("ij") with B symmetric and A nonsymmetric (or any einsum where output is symmetric and symmetry is not already present in one of the tensors in the input expression), B = A + A.T will be computed. For the contraction you are doing, if symmetries are constrained to (a,b), (c,d), and (C,D), CTF should be able to save the factor of 8 in flops (may not be quite that much faster though) and will not do symmetrization since it is preserved in the expression. To make sure the output symmetry is captured, you would also want to use either the A.i() syntax, or ctf.einsum("...",...,out=output_tensor).
Hope this is helpful.
@solomonik , Thank you so much! I am 200% appreciate your help. For N=30 or 50
in the above code, time for ctf is similar to einsum with optimize =True
and opt-einsum
,
But I am too stupid to understand a couple of points.
https://solomon2.web.engr.illinois.edu/ctf/
,A CTF::Tensor is an arbitrary-dimensional distributed array, which can be defined as usual on any CTF::World. The symmetry is specified via an array of integers (elements of enum {NS–nonsymmetric, SY–symmetric, AS–antisymmetric, and SH–symmetric hollow}) of length equal to the number of dimensions, with the entry i of the symmetric array specifying the symmetric relation between index i and index i+1.
If the symmetry denotes to between index i and i+1, for a rank-4 tensor ctf.tensor((4,4,4,4),sym=[ctf.SYM.NS,ctf.SYM.NS,ctf.SYM.SY,ctf.SYM.NS])
, I can imagine the first ctf.SYM.NS,
for indices 01, then 12, 23, but what is the role for the 4th one, ctf.SYM.NS
?
T[A,B,C,D] = T[B,A,D,C]
, not T[A,B,C,D] and T[B,A,C,D]
(suppose no that symmetry between only permute the first two or the last two), what would be the input? I tried to implement your suggestion
by usingA = ctf.tensor([N,N,N,N],sym=[ctf.SYM.NS,ctf.SYM.NS,ctf.SYM.SY,ctf.SYM.NS])
B = ctf.tensor([N,N,N,N],sym=[ctf.SYM.NS,ctf.SYM.NS,ctf.SYM.SY,ctf.SYM.NS])
C = ctf.tensor([N,N,N,N],sym=[ctf.SYM.NS,ctf.SYM.NS,ctf.SYM.SY,ctf.SYM.NS])
A = T1
B = T2
start = time.time()
C = ctf.einsum('ABab,abCD->ABCD', A, B)
end = time.time()
print('time ctf symmetrize',end - start)
print(C[0,1,2,3],C[1,0,3,2])
and got the same result. (setting out = C
leads to other numerical results)
But concerning the timing, it is similar to directly use ctf.einsum
without introducing ctf.tensor
with the symmetries. Maybe there is not enough symmetries to see the speed up. I hope using symmetry could be a part of https://solomonik.cs.illinois.edu/demos/CTF_introductory_demo.html
Thank you once again for your time.
Regarding (1), the last symmetry label should simply always be NS for CTF, even if the tensor is fully symmetric.
Regarding (2), I am aware that this symmetry appears in the 2-electron integral tensor, but there is no explicit way to express that coupled symmetry in CTF. One could symmetrize and antisymmetrize T to break it up into two tensors that sum to T and both have independent symmetries among each pair of indices.
Indeed you may not see significant speed-up from using symmetries in CTF. Not sure for what size tensors you test, but the larger the more likely you would actually see the 2X speed-up. In some cases, symmetries may require unpacking/packing the data and actually cause overhead. One thing using symmetries does save is the amount of memory needed to store the tensor (additional memory will be used at contraction-time for the tensors that are being contracted though).
For N=50
,
with
A = ctf.tensor([N,N,N,N],sym=[ctf.SYM.SY,ctf.SYM.NS,ctf.SYM.SY,ctf.SYM.NS])
B = ctf.tensor([N,N,N,N],sym=[ctf.SYM.SY,ctf.SYM.NS,ctf.SYM.NS,ctf.SYM.NS])
C = ctf.tensor([N,N,N,N],sym=[ctf.SYM.SY,ctf.SYM.NS,ctf.SYM.NS,ctf.SYM.NS])
A = T1
B = T2
start = time.time()
C = ctf.einsum('ABab,abCD->ABCD', A, B)
end = time.time()
print('time ctf symmetrize',round(end - start,2) )
the performance is so far
time opt_einsum 3 time einsum 26.8 time einsum optimize 2.83 time ctf 9.55 time ctf symmetrize 9.29.
Maybe I linked with blas dynamically. Is there any function like numpy.show_config()
(in my case openblas) can show which BLAS has been linked with ctf
? Hence, I can make sure if the BLAS
is configurated with ctf
properly?
EDITED: I found two ways to speed up the python-ctf calculations:
./configure
, in my PC, using ./configure LIB_PATH="-L/opt/OpenBLAS/lib" LIBS="-lopenblas"
,
it gets faster than dynamic library in my case. But not much speed up by symmetry.ctf-array
than numpy-array
, e.g.,
A = ctf.from_nparray(T1) B = ctf.from_nparray(T2)
supplement the above code (sorry that spitted segmentally). Then, symmetry could help.My whole test code is
import numpy as np
import time
import ctf
import opt_einsum
import statistics
#import scipy.stats as st
import scipy.stats
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
# from https://stackoverflow.com/questions/15033511/compute-a-confidence-interval-from-sample-data
def mean_confidence_interval(data, var_name, unit, confidence=0.95):
a = 1.0 * np.array(data)
n = len(a)
m, se = np.mean(a), scipy.stats.sem(a)
h = se * scipy.stats.t.ppf((1 + confidence) / 2., n-1)
print(var_name, round(m, 2), "\u00B1", round(h, 2), unit )
#return round(m, 2), round(h, 2)
N = 50
n_times = 10
n_digit = 2
T1_0 = np.random.rand(N, N, N, N)
T2_0 = np.random.rand(N, N, N, N)
T1_1 = np.zeros((N, N, N, N))
T2_1 = np.zeros((N, N, N, N))
T1 = np.zeros((N, N, N, N))
T2 = np.zeros((N, N, N, N))
T3 = np.zeros((N,N,N,N))
# symmetrize T1/T2, such that the T[a,b,c,d] = T[b,a,c,d] = T[a,b,d,c]
for a in range(N):
for b in range(N): #include a
for c in range(N):
for d in range(N):
T1_1[a,b,c,d] = ( T1_0[a,b,c,d] + T1_0[b,a,c,d] ) * 0.5
T2_1[a,b,c,d] = ( T2_0[a,b,c,d] + T2_0[b,a,c,d] ) * 0.5
for a in range(N):
for b in range(N): #include a
for c in range(N):
for d in range(N):
T1[a,b,c,d] = ( T1_1[a,b,c,d] + T1_1[a,b,d,c] ) * 0.5
T2[a,b,c,d] = ( T2_1[a,b,c,d] + T2_1[a,b,d,c] ) * 0.5
# exam symmetry
for a in range(N):
for b in range(N):
for c in range(N):
for d in range(N):
if abs(T1[b,a,c,d] - T1[a,b,c,d] ) > 1.e-5:
print('T1', a,b,c,d, T1[b,a,c,d], T1[a,b,c,d])
if abs(T2[b,a,c,d] - T2[a,b,c,d] ) > 1.e-5:
print('T2', a,b,c,d, T2[b,a,c,d], T2[a,b,c,d])
list_time = []
for i in range(n_times):
start = time.time()
out = opt_einsum.contract('ABab,abCD->ABCD', T1, T2)#, optimize="dynamic-programming")
end = time.time()
# print('time opt_einsum',round(end - start, n_digit))
list_time.append(end - start)
#print('time opt_einsum', round(statistics.mean(list_time),2), "\u00B1", round(statistics.pstdev(list_time),2) )
#print(out[0,1,2,3],out[1,0,3,2])
#print( scipy.stats.t.interval(alpha=0.95, df=len(list_time)-1, loc=np.mean(list_time), scale=scipy.stats.sem(list_time)) )
#print(
mean_confidence_interval( list_time, 'opt_einsum', 's' )
list_time = []
for i in range(n_times):
start = time.time()
out = np.einsum('ABab,abCD->ABCD', T1, T2)
end = time.time()
list_time.append(end - start)
#print('time einsum', round(end - start, n_digit) )
#print(out[0,1,2,3],out[1,0,3,2])
mean_confidence_interval( list_time, 'einsum', 's' )
list_time = []
for i in range(n_times):
start = time.time()
out = np.einsum('ABab,abCD->ABCD', T1, T2, optimize=True)
end = time.time()
list_time.append(end - start)
mean_confidence_interval( list_time, 'einsum, optimize', 's' )
#print('time einsum optimize', round(end - start, n_digit) )
#print(out[0,1,2,3],out[1,0,3,2])
list_time = []
for i in range(n_times):
start = time.time()
out3 = ctf.einsum('ABab,abCD->ABCD', T1, T2)
end = time.time()
list_time.append(end - start)
mean_confidence_interval( list_time, 'ctf', 's' )
#print('time ctf',round(end - start, n_digit) )
#print(out3[0,1,2,3],out3[1,0,3,2])
A = ctf.tensor([N,N,N,N],sym=[ctf.SYM.SY,ctf.SYM.NS,ctf.SYM.SY,ctf.SYM.NS])
B = ctf.tensor([N,N,N,N],sym=[ctf.SYM.SY,ctf.SYM.NS,ctf.SYM.SY,ctf.SYM.NS])
C = ctf.tensor([N,N,N,N],sym=[ctf.SYM.SY,ctf.SYM.NS,ctf.SYM.SY,ctf.SYM.NS])
#A.fill_random()
#B.fill_random()
#print(type(A))
A = ctf.from_nparray(T1)
B = ctf.from_nparray(T2)
#print(type(A))
list_time = []
for i in range(n_times):
start = time.time()
C = ctf.einsum('ABab,abCD->ABCD', A, B)
end = time.time()
list_time.append(end - start)
mean_confidence_interval( list_time, 'ctf-symmetrize', 's' )
#print('time ctf symmetrize',round(end - start, n_digit) )
#print(C[0,1,2,3],C[1,0,3,2])
the result is
opt_einsum 1.92 ± 0.14 s
einsum 14.42 ± 2.28 s
einsum, optimize 2.11 ± 0.56 s
ctf 2.15 ± 0.07 s
ctf-symmetrize 1.37 ± 0.01 s
For comparison purpose, my numpy.show_config()
is (still don't know why use '/usr/local/lib'
cannot work in ./configure)
>>> numpy.show_config()
blas_mkl_info:
NOT AVAILABLE
blis_info:
NOT AVAILABLE
openblas_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
runtime_library_dirs = ['/usr/local/lib']
blas_opt_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
runtime_library_dirs = ['/usr/local/lib']
lapack_mkl_info:
NOT AVAILABLE
openblas_lapack_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
runtime_library_dirs = ['/usr/local/lib']
lapack_opt_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
runtime_library_dirs = ['/usr/local/lib']
Supported SIMD extensions in this NumPy install:
baseline = SSE,SSE2,SSE3
found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2
not found = AVX512F,AVX512CD,AVX512_KNL,AVX512_KNM,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL
>>>
I have the following code
(1) it seems for various
N
,einsum
withoptimize=True
is always faster thanctf
, is there anything likeoptimize=True
inctf
?E.g.,
N=35
,time opt_einsum 2.7050588130950928 time einsum 14.024635314941406 time einsum optimize 1.4009654521942139 time ctf 5.522147178649902
(2) by
T1[a,b,c,d] = T1[b,a,d,c]
andT2[a,b,c,d] = T2[b,a,d,c]
, ifT3[ABCD] = \sum_ab T1[A,B,a,b] * T2[a,b,C,D]
, thenT3[A,B,C,D] = T3[B,A,D,C]
. How to utilize the symmetry here? I found some entries about symmetry in https://solomon2.web.engr.illinois.edu/ctf/index.html, but not in python.