[Question] How does cuTensorNet behave when `CONFIG_NUM_HYPER_SAMPLES` uses its default value (SamplerAttribute)?

PabloAndresCQ commented 3 months ago

Hi! I've been doing some experiments with some rather large circuits, trying to see how far we can push contraction-path optimisation. We are using the sampler_sample API, essentially reproducing this example. We are keeping track of the memory required by each contraction path by setting the environment value CUTENSORNET_LOG_LEVEL=6 and having a look at the logs (particularly, the lines with worksizeNeeded).

At first, we tried setting no value to CONFIG_NUM_HYPER_SAMPLES and we saw that worksizeNeeded monotonically decreases until the optimisation decides to stop. We wanted to provide more time for the optimiser to try and find better contraction paths, so we set CONFIG_NUM_HYPER_SAMPLES=100, but then the worksizeNeeded reported no longer decreased monotonically, but fluctuated across the 100 samples. In the end, the CONFIG_NUM_HYPER_SAMPLES=100 run took way longer, but it did find a worksizeNeeded somewhat lower than the default (a bit smaller than a half).

I'm attaching the two logs, showing only lineas with "worksizeNeeded" via grep "worksizeNeeded" log.txt. The _100 log corresponds to that number of samples, "_0" is for the default one. We're talking about petabytes of worksize needed here -- as I said, we are limit testing. worksizeNeeded_0.log worksizeNeeded_100.log

I would like to know a couple of things:

What is the optimiser doing when CONFIG_NUM_HYPER_SAMPLES is left to its default value.
- In particular, how do you decide to stop?
- Is the monotonic decrease shown in the logs just because you do not report samples that increase the worksizeNeeded, or is it using an optimisation algorithm that guarantees no sample with larger worksizeNeeded is explored?
Can I extend the time I leave the optimising runner for, while still using the same policy as when leaving CONFIG_NUM_HYPER_SAMPLES to default (assuming it's actually different)?
What is the deal with the worksizeNeeded=0 lines in the log? Are these samples that somehow failed and I should read that 0 as NaN?

Cheers!

EDIT: I forgot to mention, we were using cuQuantum 24.03 here.

haidarazzam commented 3 months ago

Hi, Thank you very much for reporting your observations, we will be happy to work with you resolving the issue. First, the LOGGING print any call to workspace_needed and it depends where in the optimization process this call happen the workspace can be decreasing or increasing ( I need to the full LOGs and even with it it might be hard to know). However a better path doesn't means a less expensive workspace (usually it is the case because cost and memory are correlated) but it is not a rule of thumb.

Second, let me explain some basic of the path optimizer. The number of hypersamples ( CUTENSORNET_SAMPLER_CONFIG_NUM_HYPER_SAMPLES ) specify to the path_optimizer how many paths it "can" find in order to pick the optimal one (optimal in term of computation). However, the path_optimizer has also a smart optimization turned ON by default which check the time spent in the path_optimizer and compare it to the time estimation for the computation and can stop the path_optimizer if it find that the contraction is cheap and there is no meaning to spend more time finding a better path. For example for circuits of small cost (let say a circuit that need microseconds or millisecond of computation) it will automatically stop the path_optimizer (despite setting CUTENSORNET_SAMPLER_CONFIG_NUM_HYPER_SAMPLES) if the path_optimizer time is grater than the estimate computation time (+/- certain ratio). if the circuit cost is large enough, the smart optimization has no effect.

As a summary:

More hypersamples lead in general to better path or more optimal path in particular if network is large. We have other parameters that could affect the quality of the path but it will complicate the discussion to explain all of them and also some are not directly exposed to users.
smart optimization could stop the pathfinder earlier before completing all the CUTENSORNET_SAMPLER_CONFIG_NUM_HYPER_SAMPLES but only for network with low computational cost.
a more optimal path doesn't means workspace is less because it might have different contraction shape and thus might need different storage. Most of the time cost and workspace are correlated thus cheaper path requires less memory but again it is not a rule guaranteed. Thus workspace requirement can vary from path to another up or down. If you are limited by memory, then you might set a memory limit.
In order to give you more insight, I would like to have the full LOGs
Most important, I would encourage you to dump the network pattern using the environment variable CUTENSORNET_DUMPNET_PATH=/path/to/folder/and/filename to dump the network shape and send us the txt file so we can debug it using our internal API
worksizeNeeded=0 doesn't means much with the provided logs, again it depends where in the log it has been printed, thus the need of the full log.

Looking forward to hearing from you soon,

PabloAndresCQ commented 3 months ago

Thanks for the details! I am attaching the full log of the case where no value for CONFIG_NUM_HYPER_SAMPLES was provided. I compressed it using 7zip, you might need 7zip to unzip it. The other log is too large to attach. log.zip

My remaining questions are:

What is the optimiser doing when CONFIG_NUM_HYPER_SAMPLES is left to its default value?
- In particular, how come the worksizeNeeded decreases monotonically, unlike when CONFIG_NUM_HYPER_SAMPLES=100?
Can I extend the time I let the optimiser run for, while still using the same policy as when leaving CONFIG_NUM_HYPER_SAMPLES to default (assuming it's actually different)?
What is the deal with the worksizeNeeded=0 lines in the log? This is just curiosity, if it's hard to interpret, I don't need to know.

haidarazzam commented 3 months ago

your problem seems to be very large, I can see it requires workspace ranging in the Exa-bytes.

What is the optimiser doing when CONFIG_NUM_HYPER_SAMPLES is left to its default value? In particular, how come the worksizeNeeded decreases monotonically, unlike when CONFIG_NUM_HYPER_SAMPLES=100?

Again the workspace decreasing is unrelated to hyper_samples. Within one sample, if workspace needed is larger than the available memory, then the pathfinder code will automatically try to slice the network to decrease workspace and thus you might see a monotonically decreasing workspace. Note that, when a new hyper sample start, everything is restarted.

Can I extend the time I let the optimiser run for, while still using the same policy as when leaving CONFIG_NUM_HYPER_SAMPLES to default (assuming it's actually different)?

increasing CONFIG_NUM_HYPER_SAMPLES will let the optimizer run longer

What is the deal with the worksizeNeeded=0 lines in the log? This is just curiosity, if it's hard to interpret, I don't need to know.

if the contraction cannot be executed using cuTENSOR (there is many reasons this can happen, for example due to a tensor with large number of modes > 64 ), then the workspace returned is 0 and the optimizer code will iterate and slice trying to decrease it.

The easy way to check further is to have the network pattern printed using the CUTENSORNET_DUMPNET_PATH=/path/to/folder/and/filename

PabloAndresCQ commented 3 months ago

Thanks, that was helpful. I was not aware that cuTensorNet would do slicing even when using a single GPU (I thought this was only used when parallelisation was enabled). Just to confirm, you are referring to this notion of slicing, right?

And just to confirm, from your reply I am inferring that the default value of CONFIG_NUM_HYPER_SAMPLES is 1, correct?

Indeed, we know that our problem is very large, we were limit testing. Once we saw the logs it was clear to us that these circuits were too large to be simulated with this approach, but we wanted to properly understand what the logs were displaying.

haidarazzam commented 3 months ago

yes, if contraction doesn't fit one GPU, then cuTensorNet will slice it to make it fit in 1 GPU.

Similarly for multi node multi GPU, the slicing is the techniques used to distributed workload as well as to be sure each workload fit into the GPU.

NVIDIA / cuQuantum

[Question] How does cuTensorNet behave when `CONFIG_NUM_HYPER_SAMPLES` uses its default value (SamplerAttribute)? #153