`state_compute()` leading to kernel dying.

Hi,

I was trying to use the high-level state API to compute a quantum state. I used the same code as in https://github.com/NVIDIA/cuQuantum/blob/main/python/samples/cutensornet/high_level/expectation_example.py, just instead of building an operator and expectation value, I tried to compute the state. Every time I run the following script, I get a kernel dying (on Perlmutter, with my setup working for other cuTN tasks):

import cupy as cp
import numpy as np

import cuquantum
from cuquantum import cutensornet as cutn

dev = cp.cuda.Device()  # get current device

num_qubits = 16
dim = 2
qubits_dims = (dim, ) * num_qubits

handle = cutn.create()
stream = cp.cuda.Stream()
data_type = cuquantum.cudaDataType.CUDA_C_64F

# Define quantum gate tensors on device
gate_h = 2**-0.5 * cp.asarray([[1,1], [1,-1]], dtype='complex128', order='F')
gate_h_strides = 0

gate_cx = cp.asarray([[1, 0, 0, 0],
                      [0, 1, 0, 0],
                      [0, 0, 0, 1],
                      [0, 0, 1, 0]], dtype='complex128').reshape(2,2,2,2, order='F')
gate_cx_strides = 0

free_mem = dev.mem_info[0]
scratch_size = free_mem // 2
scratch_space = cp.cuda.alloc(scratch_size)

# Create the initial quantum state
quantum_state = cutn.create_state(handle, cutn.StatePurity.PURE, num_qubits, qubits_dims, data_type)
print("Created the initial quantum state")

# Construct the quantum circuit state with gate application
tensor_id = cutn.state_apply_tensor(
        handle, quantum_state, 1, (0, ), 
        gate_h.data.ptr, gate_h_strides, 1, 0, 1)

for i in range(1, num_qubits):
    tensor_id = cutn.state_apply_tensor(
        handle, quantum_state, 2, (i-1, i),  # target on i-1 while control on i
        gate_cx.data.ptr, gate_cx_strides, 1, 0, 1)
print("Quantum gates applied")

# Configure the quantum circuit expectation value computation
num_hyper_samples_dtype = cutn.state_get_attribute_dtype(cutn.ExpectationAttribute.OPT_NUM_HYPER_SAMPLES)
num_hyper_samples = np.asarray(8, dtype=num_hyper_samples_dtype)
cutn.state_configure(handle, quantum_state, 
cutn.StateAttribute.NUM_HYPER_SAMPLES, 
num_hyper_samples.ctypes.data, num_hyper_samples.dtype.itemsize)

# Prepare the computation of the specified quantum circuit expectation value
work_desc = cutn.create_workspace_descriptor(handle)
cutn.state_prepare(handle, quantum_state, scratch_size, work_desc, stream.ptr)
print("Prepare the computation of the specified quantum circuit expectation value")

workspace_size_d = cutn.workspace_get_memory_size(handle, 
    work_desc, cutn.WorksizePref.RECOMMENDED, cutn.Memspace.DEVICE, cutn.WorkspaceKind.SCRATCH)

if workspace_size_d <= scratch_size:
    cutn.workspace_set_memory(handle, work_desc, cutn.Memspace.DEVICE, cutn.WorkspaceKind.SCRATCH, scratch_space.ptr, workspace_size_d)
else:
    print("Error:Insufficient workspace size on Device")
    cutn.destroy_workspace_descriptor(work_desc)
    cutn.destroy_state(quantum_state)
    cutn.destroy(handle)
    del scratch
    print("Free resource and exit.")

state_vector = np.empty(pow(16, 2), dtype="complex128")
cutn.state_compute(
            handle,
            quantum_state,
            work_desc,
            state_vector.ctypes.data,
            stream.ptr,
        )

The only two steps that are different from the example above are the last two lines. Is something wrong with the state_vector allocation?

If this is helpful, here is the logger output I am getting before the crash:

[2024-02-22 08:59:35][cuTensorNet][362076][Api][cutensornetGetOutputStateDetails] handle=0X55CD4EE7E3C0 tensorNetworkState=0X55CD4FFF5C00 numTensorsOut=0X7FFEFCE2B3FC numModesOut=0X0 extentsOut=0X0 stridesOut=0X0
[2024-02-22 08:59:35][cuTensorNet][362076][Api][cutensornetGetOutputStateDetails] handle=0X55CD4EE7E3C0 tensorNetworkState=0X55CD4FFF5C00 numTensorsOut=0X7FFEFCE2B3FC numModesOut=0X55CD4E1F0070 extentsOut=0X0 stridesOut=0X0
[2024-02-22 08:59:35][cuTensorNet][362076][Api][cutensornetStateCompute] handle=0X55CD4EE7E3C0 tensorNetworkState=0X55CD4FFF5C00 workDesc=0X55CD4D9EF040, extentsOut=0X55CD5004E2C0 stridesOut=0X55CD5004E2E0 stateTensorsOut=0X55CD50A3D510 cudaStream=0X55CD4E2F2AE0
[2024-02-22 08:59:35][cuTensorNet][362076][Api][cutensornetContractSlices] handle=0X55CD4EE7E3C0 plan=0X55CD4EACE980 rawDataIn=0X55CD5026A9A0 rawDataOut=0X2000 accumulateOutput=0 workDesc=0X55CD4D9EF040 sliceGroup=0X0 stream=0X55CD4E2F2AE0
[2024-02-22 08:59:35][cuTensorNet][362076][Trace][cutensornetContractSlices] Provided scratchWorkspace=0X7F6766000000 scratchWorkspaceSize=17875456 cacheWorkspace=0X0 cacheWorkspaceSize=0

Could it be cacheWorkspaceSize=0?

Many thanks!

NVIDIA / cuQuantum

`state_compute()` leading to kernel dying. #121