Open PabloAndresCQ opened 2 months ago
Hi, Thank you very much for reporting your observations, we will be happy to work with you resolving the issue. First, the LOGGING print any call to workspace_needed and it depends where in the optimization process this call happen the workspace can be decreasing or increasing ( I need to the full LOGs and even with it it might be hard to know). However a better path doesn't means a less expensive workspace (usually it is the case because cost and memory are correlated) but it is not a rule of thumb.
Second, let me explain some basic of the path optimizer.
The number of hypersamples ( CUTENSORNET_SAMPLER_CONFIG_NUM_HYPER_SAMPLES
) specify to the path_optimizer how many paths
it "can" find in order to pick the optimal one (optimal in term of computation). However, the path_optimizer has also a smart optimization turned ON by default which check the time spent in the path_optimizer and compare it to the time estimation for the computation and can stop the path_optimizer if it find that the contraction is cheap and there is no meaning to spend more time finding a better path. For example for circuits of small cost (let say a circuit that need microseconds or millisecond of computation) it will automatically stop the path_optimizer (despite setting CUTENSORNET_SAMPLER_CONFIG_NUM_HYPER_SAMPLES) if the path_optimizer time is grater than the estimate computation time (+/- certain ratio). if the circuit cost is large enough, the smart optimization has no effect.
As a summary:
CUTENSORNET_SAMPLER_CONFIG_NUM_HYPER_SAMPLES
but only for network with low computational cost. CUTENSORNET_DUMPNET_PATH=/path/to/folder/and/filename
to dump the network shape and send us the txt file so we can debug it using our internal API Looking forward to hearing from you soon,
Thanks for the details! I am attaching the full log of the case where no value for CONFIG_NUM_HYPER_SAMPLES
was provided. I compressed it using 7zip, you might need 7zip to unzip it. The other log is too large to attach.
log.zip
My remaining questions are:
CONFIG_NUM_HYPER_SAMPLES
is left to its default value?
CONFIG_NUM_HYPER_SAMPLES=100
?your problem seems to be very large, I can see it requires workspace ranging in the Exa-bytes.
What is the optimiser doing when CONFIG_NUM_HYPER_SAMPLES is left to its default value? In particular, how come the worksizeNeeded decreases monotonically, unlike when CONFIG_NUM_HYPER_SAMPLES=100?
Again the workspace decreasing is unrelated to hyper_samples. Within one sample, if workspace needed is larger than the available memory, then the pathfinder code will automatically try to slice the network to decrease workspace and thus you might see a monotonically decreasing workspace. Note that, when a new hyper sample start, everything is restarted.
Can I extend the time I let the optimiser run for, while still using the same policy as when leaving CONFIG_NUM_HYPER_SAMPLES to default (assuming it's actually different)?
increasing CONFIG_NUM_HYPER_SAMPLES will let the optimizer run longer
What is the deal with the worksizeNeeded=0 lines in the log? This is just curiosity, if it's hard to interpret, I don't need to know.
if the contraction cannot be executed using cuTENSOR (there is many reasons this can happen, for example due to a tensor with large number of modes > 64 ), then the workspace returned is 0 and the optimizer code will iterate and slice trying to decrease it.
The easy way to check further is to have the network pattern printed using the CUTENSORNET_DUMPNET_PATH=/path/to/folder/and/filename
Thanks, that was helpful. I was not aware that cuTensorNet would do slicing even when using a single GPU (I thought this was only used when parallelisation was enabled). Just to confirm, you are referring to this notion of slicing, right?
And just to confirm, from your reply I am inferring that the default value of CONFIG_NUM_HYPER_SAMPLES
is 1, correct?
Indeed, we know that our problem is very large, we were limit testing. Once we saw the logs it was clear to us that these circuits were too large to be simulated with this approach, but we wanted to properly understand what the logs were displaying.
yes, if contraction doesn't fit one GPU, then cuTensorNet will slice it to make it fit in 1 GPU.
Similarly for multi node multi GPU, the slicing is the techniques used to distributed workload as well as to be sure each workload fit into the GPU.
Hi! I've been doing some experiments with some rather large circuits, trying to see how far we can push contraction-path optimisation. We are using the
sampler_sample
API, essentially reproducing this example. We are keeping track of the memory required by each contraction path by setting the environment valueCUTENSORNET_LOG_LEVEL=6
and having a look at the logs (particularly, the lines withworksizeNeeded
).At first, we tried setting no value to
CONFIG_NUM_HYPER_SAMPLES
and we saw thatworksizeNeeded
monotonically decreases until the optimisation decides to stop. We wanted to provide more time for the optimiser to try and find better contraction paths, so we setCONFIG_NUM_HYPER_SAMPLES=100
, but then theworksizeNeeded
reported no longer decreased monotonically, but fluctuated across the 100 samples. In the end, theCONFIG_NUM_HYPER_SAMPLES=100
run took way longer, but it did find a worksizeNeeded somewhat lower than the default (a bit smaller than a half).I'm attaching the two logs, showing only lineas with "worksizeNeeded" via
grep "worksizeNeeded" log.txt
. The_100
log corresponds to that number of samples, "_0" is for the default one. We're talking about petabytes of worksize needed here -- as I said, we are limit testing. worksizeNeeded_0.log worksizeNeeded_100.logI would like to know a couple of things:
CONFIG_NUM_HYPER_SAMPLES
is left to its default value.CONFIG_NUM_HYPER_SAMPLES
to default (assuming it's actually different)?worksizeNeeded=0
lines in the log? Are these samples that somehow failed and I should read that 0 as NaN?Cheers!
EDIT: I forgot to mention, we were using cuQuantum 24.03 here.