bytedeco / javacpp-presets

The missing Java distribution of native C++ libraries
Other
2.65k stars 737 forks source link

[pytorch] how to train use nvidia gpu in windows system with javacpp pytorch #1397

Closed mullerhai closed 1 year ago

mullerhai commented 1 year ago

HI, I want to train mnist example with javacpp pytorch on windows 11 with Nvidia gpu RTX 3060 ,but after I import pytorch-gpu dependency ,but also can not invoke gpu to train only on cpu ,I don't why , is the nvidia driver version or cuda version incompatible with javacpp pytorch,or I need import other jar ? or I need change code ? @HGuillemet @saudet


system :
     windows 11
    jdk 8 
    scala  2.12.17
     cuda 
| NVIDIA-SMI 525.104      Driver Version: 528.79       CUDA Version: 12.0

val sparkVersion = "3.4.0"
val javacppPytorch = "2.0.1-1.5.10-SNAPSHOT"
//libraryDependencies += "com.nvidia" %% "rapids-4-spark" % "23.06.0"
// https://mvnrepository.com/artifact/ai.rapids/cudf
libraryDependencies += "ai.rapids" % "cudf" % "23.06.0"
libraryDependencies += "org.bytedeco" % "cuda" % "12.1-8.9-1.5.9"
// https://mvnrepository.com/artifact/org.bytedeco/numpy
libraryDependencies += "org.bytedeco" % "numpy" % "1.24.3-1.5.9"
libraryDependencies += "org.bytedeco" % "pytorch" %  javacppPytorch  // "2.0.1-1.5.10-SNAPSHOT" // "1.12.1-1.5.8" // "1.10.2-1.5.7"
// https://mvnrepository.com/artifact/org.bytedeco/pytorch-platform
libraryDependencies += "org.bytedeco" % "pytorch-platform" % javacppPytorch // "2.0.1-1.5.10-SNAPSHOT"  //  "1.12.1-1.5.8" //"1.10.2-1.5.7"
libraryDependencies += "org.bytedeco" % "pytorch-platform-gpu" %  javacppPytorch // "2.0.1-1.5.10-SNAPSHOT" // "1.12.1-1.5.8" // "1.10.2-1.5.7"
libraryDependencies += "org.bytedeco" % "mkl-platform-redist" %  "2023.1-1.5.10-SNAPSHOT" // "2023.1-1.5.9" // //  "1.12.1-1.5.8" //"1.10.2-1.5.7"

the mnist example coda just similar as our readme ,cpu as running successfully.

··· / try to use MKL when available / System.setProperty("org.bytedeco.openblas.load", "mkl") // Create a new Net. val net = new SimpleMNIST.Net 。。。 ···

HGuillemet commented 1 year ago

What error do you get ? Does it work with 2.0.1-1.5.9 ?

mullerhai commented 1 year ago

What error do you get ? Does it work with 2.0.1-1.5.9 ?

I not get any error ,just run on cpu? I don't know how to config gpu javacpp pytorch env. do I need import libraryDependencies += "org.bytedeco" % "cuda-platform" % "12.1-8.9-1.5.9" // https://mvnrepository.com/artifact/org.bytedeco/cuda-platform-redist libraryDependencies += "org.bytedeco" % "cuda-platform-redist" % "12.1-8.9-1.5.9"

And by the way ,normal dataloader dataset datareader do we has implement in javacpp pytorch ?

saudet commented 1 year ago

Please install CUDA 12.1, not 12.0.

mullerhai commented 1 year ago

Please install CUDA 12.1, not 12.0.

ok,I will install cuda 12.1, and anything else need to do ?

mullerhai commented 1 year ago

Please install CUDA 12.1, not 12.0.

javacpp cuda redist jar download very slowly from maven repo, oh my god too slow, and them to large

mullerhai commented 1 year ago

Please install CUDA 12.1, not 12.0.

ok,I will install cuda 12.1, and anything else need to do ?

After I installed cuda 12.1 ,javacpp pytorch could recognize the cuda , println(s"cuda gpu is ${cuda_is_available()}, cuda device count ${cuda_device_count()}" )

console show me cuda gpu is true, cuda device count 1,

but if tensor device change CPU to cuda ,the train process will hang-up, could not running , just for mnist example, I don't know why , the Nvidia gpu can not running for it

  @throws[Exception]
  def main(args: Array[String]): Unit = {
    /* try to use MKL when available */
    System.setProperty("org.bytedeco.openblas.load", "mkl")
//    val cusolver = Class.forName("org.bytedeco.cuda.global.cusolver")
//    Loader.load(cusolver)
    // Create a new Net.
    val net = new SimpleMNIST.Net

    import org.bytedeco.pytorch.presets.torch.cout
//    val seqs = new SequentialImpl()
//    val seqNow = new SimpleMNIST.SeqNow()
//    val seqNow = new SimpleMNIST.DictNow()
//     val seqNow = new SimpleMNIST.ListNow()

//    val seqNow = new SimpleMNIST.DictNet()
//    val seqNow = new SimpleMNIST.ModuleListNet()
    val seqNow = new SimpleMNIST.ModuleListYieldNow()
//    val seqNow = new ModuleListSequentialNow()
//    val seqNow  = new SequentialNet()
    seqNow.shiftLeft(cout())

   println(s"cuda gpu is ${cuda_is_available()}, cuda device count ${cuda_device_count()}"   )

    // Create a multi-threaded data loader for the MNIST dataset.
    val data_set = new MNIST("./data").map(new ExampleStack)
    val data_loader = new MNISTRandomDataLoader(data_set, new RandomSampler(data_set.size.get),
      new DataLoaderOptions(/*batch_size=*/ 64))

    import org.bytedeco.pytorch.{Device => TorchDevice}
    val device: TorchDevice = new TorchDevice(DeviceType.CUDA)
    // Instantiate an SGD optimization algorithm to update our Net's parameters.
    val optimizer = new SGD(seqNow.parameters, new SGDOptions(/*lr=*/ 0.01))
    //    val optimizer = new SGD(net.parameters, new SGDOptions(/*lr=*/ 0.01))
    for (epoch <- 1 to 20) {
      var batch_index = 0
      // Iterate the data loader to yield batches from the dataset.
      var it = data_loader.begin
      while ( {
        !it.equals(data_loader.end)
      }){
      //        while ( {
      //        !(it == data_loader.end)
      //      }) {
        val batch = it.access
        // Reset gradients.
        optimizer.zero_grad()
        // Execute the model on the input data.
        //        val prediction = net.forward(batch.data)
        val prediction = seqNow.forward(batch.data) .to(device, ScalarType.Float)
        val target = batch.target.to(device, ScalarType.Long)
        // Compute a loss value to judge the prediction of our model.
        println(s"prediction ${prediction.shape().mkString(":")}, prediction dtype ${prediction.dtype().toScalarType} ,  target ${target.dtype().toScalarType} label ${batch.target().shape().mkString(",")} ")
        val loss = nll_loss(prediction, batch.target)
        // Compute gradients of the loss w.r.t. the parameters of our model.
        loss.backward()
        // Update the parameters based on the calculated gradients.
        optimizer.step
        // Output the loss and checkpoint every 100 batches.
        if ( {
          batch_index += 1;
          batch_index
        } % 100 == 0) {
          System.out.println("Epoch: " + epoch + " | Batch: " + batch_index + " | Loss: " + loss.item_float)
          // Serialize your model periodically as a checkpoint.
          val archive = new OutputArchive
          //          net.save(archive)
          archive.save_to("net.pt")
        }

        it = it.increment
      }
    }
  }
mullerhai commented 1 year ago

Hi @HGuillemet @saudet,I write TensorOptions ,but I find only on Tensor or AbstructTensor consctructor create instance has TensorOptions param , the DataSet DataLoader Example ExampleVector no TensorOptions param, so how to declear the javacpp pytorch use GPU? please give me an example on gpu train code if convinient ,thanks

    import org.bytedeco.pytorch.{Device => TorchDevice}
    val device: TorchDevice = new TorchDevice(DeviceType.CUDA)
    val scalarType = ScalarType.Float
    val tensorOpt: TensorOptions = new TensorOptions(scalarType).device(new DeviceOptional(device))
HGuillemet commented 1 year ago

but if tensor device change CPU to cuda ,the train process will hang-up, could not running , just for mnist example, I don't know why , the Nvidia gpu can not running for it

Maybe the process is not hanging but is compiling PTX code. See the note at the end of the this post.

About DataLoader : I don't use them, and I didn't write the mapping code. I'll have to look closer.

mullerhai commented 1 year ago

but if tensor device change CPU to cuda ,the train process will hang-up, could not running , just for mnist example, I don't know why , the Nvidia gpu can not running for it

Maybe the process is not hanging but is compiling PTX code. See the note at the end of the this post.

About DataLoader : I don't use them, and I didn't write the mapping code. I'll have to look closer.

Maybe ,Now I will rerunning ,I get one warning msg [W D:\a\javacpp-presets\javacpp-presets\pytorch\cppbuild\windows-x86_64-gpu\pytorch\torch\csrc\jit\codegen\cuda\interface.cpp:47] Warning: Loading nvfuser library failed with: error in LoadLibrary for nvfuser_codegen.dll. WinError 126: The specified module could not be found. (function LoadingNvfuserLibrary)

mullerhai commented 1 year ago

but if tensor device change CPU to cuda ,the train process will hang-up, could not running , just for mnist example, I don't know why , the Nvidia gpu can not running for it

Maybe the process is not hanging but is compiling PTX code. See the note at the end of the this post.

About DataLoader : I don't use them, and I didn't write the mapping code. I'll have to look closer.

Hi, waiting 10 minutes,I got console error ,but I am sure cuda could use, but the Layer Module how to assign device GPU, because,the linear layer weight matrix maybe on CPU,but the input tensor has assign GPU,so get the error,


[W D:\a\javacpp-presets\javacpp-presets\pytorch\cppbuild\windows-x86_64-gpu\pytorch\torch\csrc\jit\codegen\cuda\interface.cpp:47] Warning: Loading nvfuser library failed with: error in LoadLibrary for nvfuser_codegen.dll. WinError 126: The specified module could not be found.
 (function LoadingNvfuserLibrary)
cuda gpu is true, cuda device count 1
Exception in thread "main" java.lang.RuntimeException: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)
Exception raised from common_device_check_failure at D:\a\javacpp-presets\javacpp-presets\pytorch\cppbuild\windows-x86_64-gpu\pytorch\aten\src\ATen\core\adaption.cpp:10 (most recent call first):
00007FFCF1F2DC62 <unknown symbol address> c10.dll!<unknown symbol> [<unknown file> @ <unknown line number>]
00007FFCF1F2D8AA <unknown symbol address> c10.dll!<unknown symbol> [<unknown file> @ <unknown line number>]
00007FFBC9A66246 <unknown symbol address> torch_cpu.dll!<unknown symbol> [<unknown file> @ <unknown line number>]
00007FFBAB7B60FB <unknown symbol address> torch_cuda.dll!<unknown symbol> [<unknown file> @ <unknown line number>]
00007FFBAB6C527F <unknown symbol address> torch_cuda.dll!<unknown symbol> [<unknown file> @ <unknown line number>]
00007FFBCA21DC62 <unknown symbol address> torch_cpu.dll!<unknown symbol> [<unknown file> @ <unknown line number>]
00007FFBCB735ED5 <unknown symbol address> torch_cpu.dll!<unknown symbol> [<unknown file> @ <unknown line number>]
00007FFBCB753553 <unknown symbol address> torch_cpu.dll!<unknown symbol> [<unknown file> @ <unknown line number>]
00007FFBCA193D68 <unknown symbol address> torch_cpu.dll!<unknown symbol> [<unknown file> @ <unknown line number>]
00007FFBCCA31532 <unknown symbol address> torch_cpu.dll!<unknown symbol> [<unknown file> @ <unknown line number>]
00007FFC78287440 <unknown symbol address> jnitorch.dll!<unknown symbol> [<unknown file> @ <unknown line number>]
000002850816F177 <unknown symbol address> !<unknown symbol> [<unknown file> @ <unknown line number>]
mullerhai commented 1 year ago

HI, I get success running result, trainning on gpu

  class Net() extends Module { // Construct and register two Linear submodules.

    var fc1 = new LinearImpl(784, 64)
    fc1.to(device,true)
    register_module("fc1", fc1)
    var fc2 = new LinearImpl(64, 32)
    fc2.to(device,true)
    register_module("fc2", fc2)
    var fc3 = new LinearImpl(32, 10)
    fc3.to(device,true)
    register_module("fc3", fc3)
HGuillemet commented 1 year ago

Is this issue solved and can be closed ?

mullerhai commented 1 year ago

Is this issue solved and can be closed ?

yeah ,let me close it