feiwang3311 / Lantern

BSD 3-Clause "New" or "Revised" License
167 stars 15 forks source link

Implement backend-defined data allocation and transfer ops. #22

Closed dan-zheng closed 5 years ago

dan-zheng commented 5 years ago

TODO:

feiwang3311 commented 5 years ago

Thanks Dan for the progress.

On Tue, Oct 9, 2018 at 1:38 PM Dan Zheng notifications@github.com wrote:

@dan-zheng commented on this pull request.

In src/main/scala/lantern/dslapi.scala https://github.com/feiwang3311/Lantern/pull/22#discussion_r223796893:

@@ -206,6 +169,21 @@ trait DslExp extends Dsl } }

+trait TensorOps extends DslOps {

  • object NewGPUArray {
  • // Allocate an array of the specified size on the GPU.
  • def apply[T: Manifest](scalarCount: Rep[Int]): Rep[Array[T]] = gpu_array_new(scalarCount)
  • }
  • def gpu_array_new[T: Manifest](scalarCount: Rep[Int]): Rep[Array[T]] = ???

I see.

The problem is NewGPUArray is called within TensorDsl, which LanternDriverC mixes in (by way of LanternDriverC <- LanternDriver <- TensorDslExp <- TensorDsl).

We could make a TensorDslCuda trait that extends TensorDsl and mixes in "GPUAllocOps". BackendCublas/BackendCudnn would then be defined in TensorDslCuda. I can do this sometime.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/feiwang3311/Lantern/pull/22#discussion_r223796893, or mute the thread https://github.com/notifications/unsubscribe-auth/AMOeErpZnRCOxUYX2tXe8UqhaE-RFBSaks5ujN8GgaJpZM4XTvMZ .

dan-zheng commented 5 years ago

This works!

testGPU("matrix-matrix-dot-transfer") {
  val test = new LanternDriverCublas[String, Unit] {
    @virtualize
    def snippet(x: Rep[String]): Rep[Unit] = {
      backend = BackendCPU()
      val c1 = Tensor.ones(4, 4)
      val c2 = Tensor.ones(4, 4)
      val g1 = c1.toGPU()
      val g2 = c2.toGPU()

      backend = BackendCublas()
      val g3 = g1.dot(g2) // calls `cublasSgemm`
      val c3 = g3.toCPU()

      backend = BackendCPU()
      val expected = Tensor.fill(4, 4, 4)
      Tensor.assertEqual(c3, expected)
      c3.print()
    }
  }
  runTest(test)
}
$ nvcc -std=c++11 -g lantern-snippet.cu -o snippet -lcublas
$ ./snippet dummy
4.0000000000 4.0000000000 4.0000000000 4.0000000000
4.0000000000 4.0000000000 4.0000000000 4.0000000000
4.0000000000 4.0000000000 4.0000000000 4.0000000000
4.0000000000 4.0000000000 4.0000000000 4.0000000000

It doesn't quite work for non-square matrices with non-one elements though, because cublasSgemm is performing multiplication in a different order than our CPU implementation. I'll start to investigate.

feiwang3311 commented 5 years ago

Good news! at least it is partially working :)

TiarkRompf commented 5 years ago

Awesome!