Inconsistent matrix indexing inside/outside a kernel

Bug Description

It seems we get different matrices depending on whether the code is inside a kernel if we construct the matrix by vectors.

import warp as wp

wp.init()

a = wp.mat33(wp.vec3(1.0), wp.vec3(2.0),wp.vec3(3.0))
print(a[0])

@wp.kernel
def test():
    A = wp.mat33(wp.vec3(1.0), wp.vec3(2.0),wp.vec3(3.0))
    print(A[0])

wp.launch(test,(1))

We get (1, 1, 1) on the first print but (1,2,3) on the second. The matrix a is constructed by the rows but inside a kernel it's constructed by the columns. It would be better if they run by the same rule.

Warp 1.3.0 initialized:
   CUDA Toolkit 12.5, Driver 12.2
   Devices:
     "cpu"      : "Intel64 Family 6 Model 151 Stepping 2, GenuineIntel"
     "cuda:0"   : "NVIDIA GeForce RTX 3060" (12 GiB, sm_86, mempool enabled)
[1.0, 1.0, 1.0]
Module __main__ c767a68 load on device 'cuda:0' took 0.96 ms  (cached)
1 2 3

System Information

No response

NVIDIA / warp

Inconsistent matrix indexing inside/outside a kernel #278

Bug Description

System Information