[rt] Inconsistent behavior of `cudaq::sample` when using different modes (library mode vs non-library mode)

boschmitt commented 8 months ago

Required prerequisites

[X] Make sure you've read the documentation. Your issue may be addressed there.
[X] Search the issue tracker to verify that this hasn't already been reported. +1 or comment there if it has.
[ ] If possible, make a PR with a failing test to give us a starting point to work on!

Describe the bug

Our compiler, nvq++, is capable of compiling CUDA Quantum code in two modes:

Library mode: in this case the compiler simply falls back to the host compiler.
Non-library mode (lack of better name): in this case, kernel code pass through our MLIR optimization pipeline.

The results we get from cudaq::sample are inconsistent when compiling the same program in different modes.

Steps to reproduce the bug

Take the following code kernel as an example:

#include <cudaq.h>
#include <iostream>

void foo() __qpu__ {
  cudaq::qubit a;
  cudaq::qubit b;
  x(a);
  mz(b);
  mz(a);
}

void bar() __qpu__ {
  cudaq::qubit a;
  cudaq::qubit b;
  x(a);
  auto ret_b = mz(b);
  auto ret_a = mz(a);
}

int main() {
  std::cout << "Foo:\n";
  auto result = cudaq::sample(foo);
  result.dump();

  std::cout << "Bar:\n";
  result = cudaq::sample(bar);
  result.dump();

  return 0;
}

When compiled with nvq++ foo_bar.cpp -o a, we get the following result:

Foo:
{ 01:1000 }
Bar:
{ 01:1000 }

When compiled with nvq++ --target quantinuum --emulate foo_bar.cpp -o a, we get the following result:

Foo:
{ 01:1000 }
Bar:
{
  ret_a : { 1:1000 }
   ret_b : { 0:1000 }
   __global__ : { 10:1000 }
}

Expected behavior

Getting the same result in all calls:

Foo:
{ 01:1000 }
Bar:
{ 01:1000 }

While the non-library mode allows us to have more control over what we can report back to the user, functionality that are allowed in both modes must have the same observable result.

These results are even more confusing because __global__ seems to return state of the qubits in the order of their allocation, while the other return in order of measurement.

Is this a regression? If it is, put the last known working version (or commit) here.

Not a regression

Environment

CUDA Quantum version: nvq++ proto-0.6.0 (main branch)
C++ compiler: g++ 13.2.0
Operating system: Ubuntu 23.10 (Docker container)

Suggestions

I suggest creating a new API for returning the results with variable names and __global__. This API should not be available when compiling in library mode. We could create a macro such as CUDAQ_LIBRARY_MODE to control what is available or not in library mode.

bmhowe23 commented 8 months ago

Thanks for capturing all this @boschmitt.

I suggest creating a new API for returning the results with variable names and __global__.

How would that differ from the current API that allows the user to return the results of particular variable names and __global__? (I.e. to_map(registerName), most_probable(registerName), get_marginal(registerName), etc.)

Another option would be to a) require library-mode programs to use the existing mz(qubits, "name") API, and b) provide a new API like makeGlobalRegister("name1,name2,name3") to allow the user to synthesize a global register according to an order that they specify. If any one of those names (i.e. name2) were a vector, we would need to specify what order the elements for that one segment would be in the final output string (or let the user specify).

FWIW, CircuitSimulator.h has some code comments alluding to b):

      // Form the global register from a combination of the sorted register
      // names. In the future, we may want to let the user customize

In any case, I would support disabling the mz(qubits) API in library mode (the one with no name supplied).

boschmitt commented 8 months ago

How would that differ from the current API that allows the user to return the results of particular variable names and __global__? (I.e. to_map(registerName), most_probable(registerName), get_marginal(registerName), etc.)

It doesn't. I should have made my suggestion clearer. The focus of this issue is to make the behavior consistent. The issue with __global__ was an afterthought. Thus, I'm basically suggesting that cudaq::sample always return the same thing for the examples above:

 Foo:
{ 01:1000 }
Bar:
{ 01:1000 }

For the other format of result, we should create new function, say, cudaq::fancy_sample, that would be a compilation error in library mode and return the following when compiling without library mode:

Foo:
{
   __global__ : { 10:1000 }
}
Bar:
{
  ret_a : { 1:1000 }
   ret_b : { 0:1000 }
   __global__ : { 10:1000 }
}

Another option would be to a) require library-mode programs to use the existing mz(qubits, "name") API, and b) provide a new API like makeGlobalRegister("name1,name2,name3") to allow the user to synthesize a global register according to an order that they specify. If any one of those names (i.e. name2) were a vector, we would need to specify what order the elements for that one segment would be in the final output string (or let the user specify).

Good suggestion, in library mode mz could be a macro that would take de form mz(qubit, name). This macro could expand to something that both declares a variable and pass the name along to the execution manager:

void bar() __qpu__ {
  cudaq::qubit a;
  cudaq::qubit b;
  x(a);
  mz(b, ret_b); // declare ret_b
  mz(a, ret_a); // declare ret_a
  if (ret_a && ret_b) // variable can be used afterwards
    ...
}

This macro could expand to a generic name or(plus) line number in case a name is not given. Anyway, we should open another issue to discuss possibilities here, and maybe yet another issue to deal with __global__.

bmhowe23 commented 8 months ago

I should have made my suggestion clearer. The focus of this issue is to make the behavior consistent. The issue with __global__ was an afterthought. Thus, I'm basically suggesting that cudaq::sample always return the same thing for the examples above.

It might be worth spelling out some additional more complex samples, too. I'd suggest considering something like as well (unless this this violates some part of our language spec, in which case we should throw an error):

struct kernel {
  void operator()() __qpu__ {
    cudaq::qubit q1;
    cudaq::qubit q2;
    cudaq::qubit q0;
    h(q0);
    if (mz(q0)) {
      h(q1);
      mz(q1);
    }
    x(q2);
    mz(q2);
  }
};

amccaskey commented 8 months ago

I am not in favor of disabling mz(q) in library mode. There are many use cases where the user cares what measurements are applied but do not care what the name of the register is. I would be in favor of some sort of preprocessor that translates all auto r = mz(q) patterns to auto r = mz(q,"r") before compilation occurs. This differs from the macro example above in that the user does not have to specify the register argument in the mz call, but is implicitly specifying the register name via the variable name they write (which is what we get in MLIR mode). This would resolve the output discrepancies (and remove the need for the auto_register_IDX variable name we generate.

amccaskey commented 8 months ago

Here's an example of a simple library-mode source preprocessor to fix the discrepancy described in this issue https://github.com/NVIDIA/cuda-quantum/compare/main...amccaskey:cuda-quantum:nvqppPreprocessor

bmhowe23 commented 8 months ago

Thanks, @amccaskey. I think the issue is bigger than the one pair of examples provided in this issue. For example, the following code will cause similar issues that cannot be resolved purely by translating auto r = mz(q) patterns into auto r = mz(q,"r").

#include <cudaq.h>
#include <iostream>

void foo() __qpu__ {
  cudaq::qubit a, b;
  cudaq::x(a);
  auto ma1 = mz(a); // 1st measurement of qubit a
  auto ma2 = mz(a); // 2nd measurement of qubit a
  auto mb = mz(b);
}

int main() {
  std::cout << "Foo:\n";
  auto result = cudaq::sample(foo);
  result.dump();

  return 0;
}

When running w/ main plus #997 and #998, library mode returns this:

Foo:
{ 10:1000 }

Whereas compiling with nvq++ --target quantinuum --emulate returns this (again, using main + #997 and #998):

Foo:
{ 
  __global__ : { 110:1000 }         <-- Note the extra bit in there (110 vs 10)
   ma1 : { 1:1000 }
   ma2 : { 1:1000 }
   mb : { 0:1000 }
}

The reason for the difference may already be apparent, but it becomes even more apparent when you think about it from the perspective of the generated QIR and how the runtime treats the backend/simulator results (more below):

@cstr.6D613100 = private constant [4 x i8] c"ma1\00"
@cstr.6D613200 = private constant [4 x i8] c"ma2\00"
@cstr.6D6200 = private constant [3 x i8] c"mb\00"

; <snip>

define void @__nvqpp__mlirgen__function_foo._Z3foov() local_unnamed_addr #1 {
"0":
  tail call void @__quantum__qis__x__body(%Qubit* null)
  tail call void @__quantum__qis__mz__body(%Qubit* null, %Result* writeonly null)
  tail call void @__quantum__qis__mz__body(%Qubit* null, %Result* nonnull writeonly inttoptr (i64 1 to %Result*))
  tail call void @__quantum__qis__mz__body(%Qubit* nonnull inttoptr (i64 1 to %Qubit*), %Result* nonnull writeonly inttoptr (i64 2 to %Result*))
  tail call void @__quantum__rt__result_record_output(%Result* null, i8* nonnull getelementptr inbounds ([4 x i8], [4 x i8]* @cstr.6D613100, i64 0, i64 0))
  tail call void @__quantum__rt__result_record_output(%Result* nonnull inttoptr (i64 1 to %Result*), i8* nonnull getelementptr inbounds ([4 x i8], [4 x i8]* @cstr.6D613200, i64 0, i64 0))
  tail call void @__quantum__rt__result_record_output(%Result* nonnull inttoptr (i64 2 to %Result*), i8* nonnull getelementptr inbounds ([3 x i8], [3 x i8]* @cstr.6D6200, i64 0, i64 0))
  ret void
}

A backend (including our local simulators) will accordingly produce a CountsDictionary with one entry for each %Result:

{ 
   ma1 : { 1:1000 }
   ma2 : { 1:1000 }
   mb : { 0:1000 }
}

What does the `global` register mean in QIR programs?

There is initially no __global__ register in results generated by QIR programs. The runtime has to construct one that matches what library mode does in cudaq::sample kernels. (This is the root cause of most of our headaches.)

If the goal is to have the __global__ register be "qubit-based" instead of "result-based" (as is reflected in the library mode results above where the __global__ register contains "10"), then we will have to update the compiler and runtime to ignore all measurements that weren't the last one corresponding to each qubit when forming the __global__ register. That's definitely doable, but I believe something like that would be required to resolve this issue. The change to translate auto r = mz(q) patterns into auto r = mz(q,"r") wouldn't fix the problem (though it wouldn't hurt either).
If, on the other hand, the goal is to have the __global__ register be "results-based" (as in, "QIR results-based"), then we need a different solution. We would need to update the library mode differently than how (main + #997 and #998) currently does it because the __global__ register bitstring needs to look like "110" whereas it currently only contains "10".

I think we need to decide which of the above 2 bullets is preferred (or entertain other options) before deciding on a path forward.

More rambling

Stepping back a bit, I think our library mode vs non-library modes w.r.t. cudaq::sample ultimately stem from cudaq::sample being treated like it's a circuit in library mode (e.g. evolve the state vector once and generate all the shots from the final state vector) whereas non-library mode treats with the fully expressive possibilities of MLIR and QIR (e.g., reusing qubits and returning multiple %Results). I think we need to try to reconcile those in my opinion. This is where allowing a fully-expressive, user-defined return value from cudaq::sample kernels would go a long way toward resolving the current ambiguity of the __global__ register.

And even more rambling

There are many use cases where the user cares what measurements are applied but do not care what the name of the register is.

I'm concerned that a user won't be able to use the result if they can't find it. If they are lucky enough to find it one time (e.g., perhaps this time it was indices 5-8 of the __global__ bitstring), then I'm concerned about what would happen when the kernel gets updated, especially when if they don't have access to the kernel source code. For example, if the __global__ register is ordered based on the order of the mz() calls (as it currently is in library mode), and if the kernel develop inserts an ancilla qubit into their library for some reason, the bits that the downstream user cared may have been moved to indices 6-9 because a new measurement was inserted near the beginning of the kernel, and it would be non-trivial for them to detect that and update their downstream code.

amccaskey commented 7 months ago

If __global__ in your example is producing '110' for the remote rest qpu, then that is an oversight on my part. This is not what was intended for __global__. This classical register is meant to describe the final measured state of all measured qubits (the classical register size should equal number of qubits being measured). It is not meant to collect measurements, thus giving a register size larger than the number of qubits allocated in the kernel. What you are describing was more the intention of sample_result::sequential_data (though perhaps it needs some updating / refactor).

bmhowe23 commented 7 months ago

If __global__ in your example is producing '110' for the remote rest qpu, then that is an oversight on my part. This is not what was intended for __global__. This classical register is meant to describe the final measured state of all measured qubits (the classical register size should equal number of qubits being measured). It is not meant to collect measurements, thus giving a register size larger than the number of qubits allocated in the kernel. What you are describing was more the intention of sample_result::sequential_data (though perhaps it needs some updating / refactor).

OK, thanks for the clarifications. I think the runtime's synthesis of the __global__ register needs to be updated to form the global register differently than it is currently doing. Here are couple of spots that will need updates (certainly not an exhaustive list):

What you are describing was more the intention of sample_result::sequential_data (though perhaps it needs some updating / refactor).

Yeah, we may need to update and clarify that. I had always assumed that the global register was just a histogram of the sequential data, but it sounds like that isn't exactly correct.

bmhowe23 commented 7 months ago

Note that if you don't specify any measurements in your kernel, the current behavior of cudaq::sample is that compiler will implicitly add those measurements for you at the end of the kernel. If you allow the nvq++ compiler to perform passes that introduce ancilla qubits into your kernel (as is done in the multicontrol-decomposition pass), those results would contain measurements coming from the additional ancilla qubits. How should this be handled? (If desired, we could open a separate issue for that.)

NVIDIA / cuda-quantum