NVIDIA / cuQuantum

Home for cuQuantum Python & NVIDIA cuQuantum SDK C++ samples
https://docs.nvidia.com/cuda/cuquantum/
BSD 3-Clause "New" or "Revised" License
320 stars 63 forks source link

Multithreaded cutn optimization issue #102

Closed sss441803 closed 6 months ago

sss441803 commented 7 months ago

I have the following code for contraction optimization using multiple threads. I set the number of samples to 64. What I observe is that if I set the number of threads to 1, the optimization takes 219 seconds. If I set the threads to 64, it takes 89 seconds. The resulting quality is not very different. I expect a much faster time to solution with 64 threads, and I can see that the CPU utilization indeed goes above 6000% for a substantial amount of time.

The machine has a single AMD Zen 3 (Milan) 32 core 64 thread CPU, and an A100 GPU. Even if I set the number of threads to 32, the CPU utilization is above 3100%, and the time for 64 samples is 65 seconds with similar performance. For 8 threads, the time is 47 seconds.

Versions:

cuquantum-python-cu11     23.3.0                   pypi_0    pypi
custatevec-cu11           1.5.0                    pypi_0    pypi
cutensor-cu11             1.7.0                    pypi_0    pypi
cutensornet-cu11          2.3.0                    pypi_0    pypi

Code:

import cuquantum.cutensornet as cutn
from cuquantum.cutensornet import configuration
from cuquantum import Network
import numpy as np
import time

expression = 'a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,À,Á,Â,Ã,yÜ,rÕ,ÜÕ,Jç,mÐ,çÐ,Iæ,Pí,æí,qÔ,vÙ,ÔÙ,Mê,lÏ,êÏ,Àø,Rï,øï,Ãû,Hå,ûå,Tñ,AÞ,ñÞ,uØ,oÒ,ØÒ,jÍ,Lé,Íé,cÆ,Bß,Æß,zÝ,kÎ,ÝÎ,Áù,xÛ,ùÛ,Wô,iÌ,ôÌ,nÑ,Gä,Ñä,Eâ,fÉ,âÉ,Uò,wÚ,òÚ,eÈ,sÖ,ÈÖ,t×,Xõ,×õ,Z÷,pÓ,÷Ó,Yö,Kè,öè,aÄ,hË,ÄË,dÇ,bÅ,ÇÅ,Oì,Fã,ìã,Sð,Âú,ðú,gÊ,Cà,Êà,Dá,Vó,áó,Qî,Në,îë,éď,øĆ,ďĆ,Éě,êĄ,ěĄ,ßđ,ùĔ,đĔ,ÇĨ,ÑĘ,ĨĘ,ÈĞ,Åĩ,Ğĩ,Ùă,ôĖ,ăĖ,ûĈ,ðĬ,ĈĬ,úĭ,æĀ,ĭĀ,èĥ,õġ,ĥġ,ãī,åĉ,īĉ,Òč,âĚ,čĚ,Ìė,Ûĕ,ėĕ,Ëħ,ØČ,ħČ,ïć,óı,ćı,Ðÿ,äę,ÿę,Öğ,íā,ğā,Óģ,Îē,ģē,Üü,ñĊ,üĊ,ìĪ,ÆĐ,ĪĐ,Úĝ,Õý,ĝý,ÍĎ,áİ,Ďİ,çþ,ÄĦ,þĦ,Þċ,ëij,ċij,ÊĮ,àį,Įį,×Ġ,÷Ģ,ĠĢ,Ïą,îIJ,ąIJ,öĤ,ÝĒ,ĤĒ,ÔĂ,òĜ,ĂĜ,Ĉŀ,ÿŐ,ŀŐ,Ēũ,ĀŃ,ũŃ,Ĝū,ğŒ,ūŒ,ýś,ēŕ,śŕ,đĸ,ĔĹ,ĸĹ,ĮŢ,Ěʼn,Ţʼn,Ċŗ,ĂŪ,ŗŪ,ĎŜ,Đř,Ŝř,ĘĻ,ăľ,Ļľ,ıŏ,ĝŚ,ŏŚ,ęő,īņ,őņ,þŞ,Ĩĺ,Şĺ,ĉŇ,ėŊ,ŇŊ,İŝ,ćŎ,ŝŎ,ĭł,ĬŁ,łŁ,ěĶ,ċŠ,ĶŠ,ĕŋ,ĥń,ŋń,üŖ,Ħş,Ŗş,čň,ĤŨ,ňŨ,ġŅ,ďĴ,ŅĴ,ĩĽ,IJŧ,Ľŧ,Ćĵ,ąŦ,ĵŦ,ħŌ,Ģť,Ōť,āœ,ĖĿ,œĿ,ĪŘ,ģŔ,ŘŔ,Ğļ,ijš,ļš,Čō,ĠŤ,ōŤ,Ąķ,įţ,ķţ,ŘƜ,Ŋƅ,Ɯƅ,Ńů,şƏ,ůƏ,ŦƗ,ľŽ,ƗŽ,Ŝź,ŀŬ,źŬ,ŠƋ,ļƞ,Ƌƞ,ŗŸ,ūŰ,ŸŰ,ŇƄ,ŎƇ,ƄƇ,śŲ,ţƣ,Ųƣ,ņƁ,Ŀƛ,Ɓƛ,Ťơ,ŋƌ,ơƌ,ĴƓ,ōƠ,ƓƠ,œƚ,Śſ,ƚſ,ŖƎ,Ĺŵ,Ǝŵ,ʼnŷ,ĵƖ,ŷƖ,ńƍ,šƟ,ƍƟ,ĸŴ,łƈ,Ŵƈ,ŪŹ,ŔƝ,ŹƝ,ŁƉ,řŻ,ƉŻ,ŕų,ĶƊ,ųƊ,ŝƆ,őƀ,Ɔƀ,ĺƃ,ķƢ,ƃƢ,ũŮ,Œű,Ůű,Ļż,ŞƂ,żƂ,ŧƕ,ňƐ,ƕƐ,Őŭ,ŢŶ,ŭŶ,ŏž,ťƙ,žƙ,ŌƘ,Ņƒ,Ƙƒ,ŨƑ,ĽƔ,ƑƔ,ŹDŽ,Ɵǁ,DŽǁ,ƈǃ,ƝDž,ǃDž,ƛƵ,ųLj,ƵLj,Ŵǂ,ƀNj,ǂNj,ŲƲ,ŷƾ,Ʋƾ,ƇƱ,ƎƼ,ƱƼ,Ɗlj,űǏ,ljǏ,ƏƧ,ƣƳ,ƧƳ,Ơƹ,ƅƥ,ƹƥ,Ɩƿ,ƓƸ,ƿƸ,ŻLJ,ƙǗ,LJǗ,žǖ,ƌƷ,ǖƷ,ƐǓ,ŰƯ,ǓƯ,ƔǛ,Ůǎ,Ǜǎ,Ƒǚ,Ɨƨ,ǚƨ,Ƅư,ŸƮ,ưƮ,ơƶ,ƞƭ,ƶƭ,ŵƽ,ƍǀ,ƽǀ,Ŭƫ,żǐ,ƫǐ,ŶǕ,źƪ,Ǖƪ,ƚƺ,ƃnj,ƺnj,ŭǔ,ƢǍ,ǔǍ,ƒǙ,Ɓƴ,Ǚƴ,ƋƬ,Ƙǘ,Ƭǘ,ŽƩ,ſƻ,Ʃƻ,ƕǒ,ƆNJ,ǒNJ,ƂǑ,Ɖdž,Ǒdž,ůƦ,ƜƤ,ƦƤ,ƴȉ,ƿǮ,ȉǮ,Ǎȇ,ǒȎ,ȇȎ,ƶǼ,ƫȀ,ǼȀ,ǑȐ,Ǐǩ,Ȑǩ,ǗDZ,ƪȃ,DZȃ,ƵǠ,džȑ,Ǡȑ,ƹǬ,Ljǡ,Ǭǡ,ƨǹ,ƾǥ,ǹǥ,ƲǤ,ǚǸ,ǤǸ,ǛǶ,Ƹǯ,Ƕǯ,Ƥȓ,ǐȁ,ȓȁ,ǘȋ,ƺȄ,ȋȄ,Ƽǧ,Ƴǫ,ǧǫ,ưǺ,njȅ,Ǻȅ,ƬȊ,ǖDz,ȊDz,ǀǿ,ƧǪ,ǿǪ,Ưǵ,ljǨ,ǵǨ,Ʈǻ,ǕȂ,ǻȂ,ǔȆ,ǓǴ,ȆǴ,ƩȌ,ǁǝ,Ȍǝ,ƽǾ,ƻȍ,Ǿȍ,DŽǜ,Džǟ,ǜǟ,ǎǷ,ƱǦ,ǷǦ,ǂǢ,ǙȈ,ǢȈ,NJȏ,Ʒdz,ȏdz,ƭǽ,ǃǞ,ǽǞ,ƥǭ,Njǣ,ǭǣ,LJǰ,ƦȒ,ǰȒ,ǫȭ,Ǫȳ,ȭȳ,ǥȣ,ȓȨ,ȣȨ,ǼȘ,ǽɆ,ȘɆ,ǧȬ,ǩț,Ȭț,Ǩȵ,ǤȤ,ȵȤ,ȅȯ,ǣɉ,ȯɉ,DZȜ,ȍȽ,ȜȽ,ȇȖ,Ǣɂ,Ȗɂ,ǵȴ,Ǹȥ,ȴȥ,ǬȠ,ǰɊ,ȠɊ,ǺȮ,ȉȔ,ȮȔ,ǡȡ,dzɅ,ȡɅ,ȋȪ,Ǿȼ,Ȫȼ,ȃȝ,ǟȿ,ȝȿ,Ȏȗ,ȌȺ,ȗȺ,ǶȦ,ǠȞ,ȦȞ,ȁȩ,ǹȢ,ȩȢ,Ȃȷ,Ȇȸ,ȷȸ,ȏɄ,ǯȧ,Ʉȧ,ȊȰ,ǻȶ,Ȱȶ,ȑȟ,Ǯȕ,ȟȕ,Ȓɋ,ǝȻ,ɋȻ,ȐȚ,Ǵȹ,Țȹ,Dzȱ,ȈɃ,ȱɃ,ǦɁ,ǭɈ,ɁɈ,Ƿɀ,Ȁș,ɀș,Ǟɇ,ǿȲ,ɇȲ,Ȅȫ,ǜȾ,ȫȾ,ȗɨ,ȹɹ,ɨɹ,ȣɎ,Ȟɫ,Ɏɫ,Ȧɪ,Ʌɣ,ɪɣ,Ɉɽ,ȯɖ,ɽɖ,Ȯɠ,ȷɮ,ɠɮ,Țɸ,Ȱɲ,ɸɲ,Ȭɒ,Ȼɷ,ɒɷ,ȳɍ,Ɇɑ,ɍɑ,Ƞɞ,ȥɝ,ɞɝ,ȭɌ,ɂɛ,Ɍɛ,Ⱦʃ,ȡɢ,ʃɢ,Ȩɏ,Ɂɼ,ɏɼ,Ȗɚ,ȴɜ,ɚɜ,ȧɱ,Ⱥɩ,ɱɩ,Ȝɘ,ɉɗ,ɘɗ,ȕɵ,ȸɯ,ɵɯ,Ȳʁ,Ʉɰ,ʁɰ,ɇʀ,Ȥɕ,ʀɕ,ȟɴ,ȫʂ,ɴʂ,ȶɳ,ȝɦ,ɳɦ,ɀɾ,Ȕɡ,ɾɡ,Ƚə,țɓ,əɓ,ȿɧ,Ȣɭ,ɧɭ,Șɐ,șɿ,ɐɿ,ȱɺ,ȵɔ,ɺɔ,Ɋɟ,ȼɥ,ɟɥ,Ȫɤ,ȩɬ,ɤɬ,ɋɶ,Ƀɻ,ɶɻ,ɾʬ,ɿʳ,ʬʳ,ɯʣ,ɟʶ,ʣʶ,ɧʰ,ɷʑ,ʰʑ,ɱʞ,ɗʡ,ʞʡ,ɦʫ,ɲʏ,ʫʏ,ɰʥ,ɑʓ,ʥʓ,ɖʋ,ɓʯ,ʋʯ,ɨʄ,ɣʉ,ʄʉ,ɽʊ,ɐʲ,ʊʲ,ɍʒ,ʁʤ,ʒʤ,Ɏʆ,ɢʙ,ʆʙ,əʮ,ɥʷ,ʮʷ,ɬʹ,ɞʔ,ʹʔ,ɤʸ,ɻʻ,ʸʻ,ʃʘ,ɵʢ,ʘʢ,ɴʨ,ɭʱ,ʨʱ,ɛʗ,ɮʍ,ʗʍ,ɏʚ,ɕʧ,ʚʧ,ɸʎ,ɚʜ,ʎʜ,ɒʐ,ɝʕ,ʐʕ,Ɍʖ,ɪʈ,ʖʈ,ɩʟ,ɹʅ,ʟʅ,ɫʇ,ɶʺ,ʇʺ,ɡʭ,ʂʩ,ʭʩ,ɔʵ,ɜʝ,ʵʝ,ɳʪ,ɺʴ,ʪʴ,ɼʛ,ɠʌ,ʛʌ,ɘʠ,ʀʦ,ʠʦ,ʒˎ,ʈ˥,ʶʿ,ʋˈ,ʓˇ,ʵˬ,ʍ˝,ʸ˖,ʊˌ,ʫ˄,ʙˑ,ʱ˛,ʣʾ,ʹ˔,ʅ˧,ʰˀ,ʘ˘,ʧ˟,ʯˉ,ʲˍ,ʔ˕,ʟ˦,ʮ˒,ʉˋ,ʐˢ,ʬʼ,ʏ˅,ʗ˜,ʖˤ,ʥˆ,ʆː,ʭ˪,ʡ˃,ʌ˱,ʺ˩,ʎˠ,ʤˏ,ʞ˂,ʜˡ,ʑˁ,ʚ˞,ʩ˫,ʄˊ,ʛ˰,ʨ˚,ʪˮ,ʝ˭,ʴ˯,ʇ˨,ʻ˗,ʕˣ,ʠ˲,ʢ˙,ʷ˓,ʦ˳,ʳʽ,ˎ,˥,ʿ,ˈ,ˇ,ˬ,˝,˖,ˌ,˄,ˑ,˛,ʾ,˔,˧,ˀ,˘,˟,ˉ,ˍ,˕,˦,˒,ˋ,ˢ,ʼ,˅,˜,ˤ,ˆ,ː,˪,˃,˱,˩,ˠ,ˏ,˂,ˡ,ˁ,˞,˫,ˊ,˰,˚,ˮ,˭,˯,˨,˗,ˣ,˲,˙,˓,˳,ʽ->'

tensor_list = expression.split(',')
tensor_list[-1] = tensor_list[-1][:-2]
operands = [np.zeros([2] * len(tensor)) for tensor in tensor_list]

threads = 1 # This is the only difference when I change the number of threads
network = Network(expression, *operands)
network.optimizer_config_ptr = cutn.create_contraction_optimizer_config(network.handle)
network._set_opt_config_option('SIMPLIFICATION_DISABLE_DR', cutn.ContractionOptimizerConfigAttribute.SIMPLIFICATION_DISABLE_DR, 1)
optimizer_options = configuration.OptimizerOptions(samples=64, threads=threads)
print('start')
start = time.time()
path, info = network.contract_path(optimize=optimizer_options)
print(f'Time: {time.time() - start}. Cost: {info.opt_cost}')
haidarazzam commented 7 months ago

The scalability over number of threads depends on the network and the number of samples. The time per sample vary a lot. When number of samples is small (comparable to number of threads) a particular thread might endup with 2-3 slow samples thus limiting the scalability. Small network are too fast such as most of the time is spent in the threading lock/unlock. Overall for a network < 1000, 4 and 8 threads provide the best timing. Increasing number of threads could slow down (it is a combination of lock/unlock and unlucky thread that got the slowest samples). It is on our TODO list to recheck deeply this issue and try to provide a more scalable solution.