Closed dan-zheng closed 6 years ago
Thanks Dan for the progress.
On Tue, Oct 9, 2018 at 1:38 PM Dan Zheng notifications@github.com wrote:
@dan-zheng commented on this pull request.
In src/main/scala/lantern/dslapi.scala https://github.com/feiwang3311/Lantern/pull/22#discussion_r223796893:
@@ -206,6 +169,21 @@ trait DslExp extends Dsl } }
+trait TensorOps extends DslOps {
- object NewGPUArray {
- // Allocate an array of the specified size on the GPU.
- def apply[T: Manifest](scalarCount: Rep[Int]): Rep[Array[T]] = gpu_array_new(scalarCount)
- }
- def gpu_array_new[T: Manifest](scalarCount: Rep[Int]): Rep[Array[T]] = ???
I see.
The problem is NewGPUArray is called within TensorDsl, which LanternDriverC mixes in (by way of LanternDriverC <- LanternDriver <- TensorDslExp <- TensorDsl).
We could make a TensorDslCuda trait that extends TensorDsl and mixes in "GPUAllocOps". BackendCublas/BackendCudnn would then be defined in TensorDslCuda. I can do this sometime.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/feiwang3311/Lantern/pull/22#discussion_r223796893, or mute the thread https://github.com/notifications/unsubscribe-auth/AMOeErpZnRCOxUYX2tXe8UqhaE-RFBSaks5ujN8GgaJpZM4XTvMZ .
This works!
testGPU("matrix-matrix-dot-transfer") {
val test = new LanternDriverCublas[String, Unit] {
@virtualize
def snippet(x: Rep[String]): Rep[Unit] = {
backend = BackendCPU()
val c1 = Tensor.ones(4, 4)
val c2 = Tensor.ones(4, 4)
val g1 = c1.toGPU()
val g2 = c2.toGPU()
backend = BackendCublas()
val g3 = g1.dot(g2) // calls `cublasSgemm`
val c3 = g3.toCPU()
backend = BackendCPU()
val expected = Tensor.fill(4, 4, 4)
Tensor.assertEqual(c3, expected)
c3.print()
}
}
runTest(test)
}
$ nvcc -std=c++11 -g lantern-snippet.cu -o snippet -lcublas
$ ./snippet dummy
4.0000000000 4.0000000000 4.0000000000 4.0000000000
4.0000000000 4.0000000000 4.0000000000 4.0000000000
4.0000000000 4.0000000000 4.0000000000 4.0000000000
4.0000000000 4.0000000000 4.0000000000 4.0000000000
It doesn't quite work for non-square matrices with non-one elements though, because cublasSgemm
is performing multiplication in a different order than our CPU implementation. I'll start to investigate.
Good news! at least it is partially working :)
Awesome!
backend.mallocArray
to allocate tensor data.TensorNew
,TensorApply
,TensorUpdate
expressions.Dsl
->DslOps
(the trait does define ops)TensorExp
->TensorDsl
(the trait doesn't define expressions)BackendNative
->BackendCPU
toCPU
andtoGPU
transfer ops.TODO:
free
operation.DslDriverXX
andLanternDriverXX
for simplicity.