dotnet / machinelearning-modelbuilder

Simple UI tool to build custom machine learning models.
Creative Commons Attribution 4.0 International
260 stars 52 forks source link

Training network on GPU. Error: "The JSON-RPC connection with the remote party was lost before the request could complete." #1712

Open Exterminated opened 2 years ago

Exterminated commented 2 years ago

System Information (please complete the following information):

Describe the bug

I am trying to take the first steps in a home project, I decided to try to train the network using my local video card, but I ran into the problem "The JSON-RPC connection .." After searching Google for information, Microsoft recognized this error and they fixed it in 16.5.4, but my environment is newer. I tried to restore the installation via Visual Studio Installer, but to no avail. Has anyone encountered such a problem? Can someone suggest how to deal with it.

To Reproduce Steps to reproduce the behavior:

  1. Go to 'Add..'->Machine Learning
  2. Click on 'Image classification'
  3. Select on Environment tab Local (GPU)
  4. Click on "Check combability"
  5. Click "Next step" and select dataset (i'm using default dataset with flowers classification) and click "Next step"
  6. On Train step click "Train" and wait few seconds
  7. See error

Expected behavior Training proceeded without error and i can move to "Evaluate" step

Screenshots image image image

Additional context 06428ff7-b6fc-4b0a-88b4-4156f15cf916.txt c652506b-4dd5-4f80-8abe-efbc08b38f84.txt 9754cd7b-5563-44ff-abfd-8f684fbf7c68.txt

beccamc commented 2 years ago

@LittleLittleCloud We have another The JSON-RPC connection with the remote party was lost before the request could complete error. This one is on Dev16. Do you have any ideas?

LittleLittleCloud commented 2 years ago

The error The JSON-RPC connection with the remote party was lost before the request could complete error simply indicates that the training process was terminated unexpectedly. From the log provided, it seems that the training process was terminated after RawByteImageLoading and before image classification, which is probably where training is broken.

The reason why the training process being terminated can be various. It could be OOM in system, GPU driver issue or other things. Based on the log provided, there's little way for us to find out which is the root cause.

It would be of great help if you can rule out a few possible causes from framework side by running the project (You don't need to wait the training to complete, as long as it starts training the experiment can be deemed as success). It's basically a call into MLNet image classification API, which is the same one used by Model Builder in image classification scenario.

Exterminated commented 2 years ago

Thank you for responding to my problem! I started the project from the last comment. It looks fine Results in file imageClassificationSample-24-08-21.txt

LittleLittleCloud commented 2 years ago

@Exterminated

Thanks for the response. So it seems that mlnet framework itself doesn't have problem. And let's try if we can rule out the automl service as well by running the two experiments below:

First experiment

Try CPU training in ModelBuilder by selecting CPU in Environment step

Second experiment

Try image classification in mlnet cli.

If both experiments can be run without problem, we can assume that the error might be in servicehub side

Exterminated commented 2 years ago

Yep, both experiments was successful. What i should check/reinstall with servicehub side? mlnet-image-test-25-08-2021.txt

LittleLittleCloud commented 2 years ago

We found another issue #1543 which has the same error with this one.

LittleLittleCloud commented 2 years ago

@Exterminated

The result looks interesting... Because the first experiment is to verify if servicehub works and the second is to verify if training code works. Since it's both working, theoratically GPU training should also work without any problem. So we might need to dig deeper into this issue. And it will be useful if you can provide us with servicehub logs, through the following way

How to get all-level servicehub log

Servicehub logs are located in %temp%servicehub folder. And you can get all-level logs through the following steps

And could you also check your GPU memory usage while launching GPU training, especially checking if there's a spike or increase in GPU memory usage before training fail. Because we are suspecting that the error might be caused by GPU OOM error while initializing cudnn. If so, you could try the following method mentioned in this reply

Exterminated commented 2 years ago

Seems my GPU don't involve in training at all( There is no any load when i start to train models. I have geforce 750ti, maybe this card don't fit for training? I collected 2 servicehub logs, with TF_FORCE_GPU_ALLOW_GROWTH = false and TF_FORCE_GPU_ALLOW_GROWTH = true image Logs.zip

LittleLittleCloud commented 2 years ago

No I don't think so because the GPU training was successful in the project I gave you. So Gtx750ti should be fine.. Other than that, the log you shares also looks.... normal.. So the problem might be still on AutoMLService side. The problem is we don't know what's the error, it might be tf binary not found, some initialize problem or whatever, the sad thing is servicehub doesn't return any error information on it.

If you can do us a favor (again) and try launching GPU training through mlnet.cli directly, then we might be able to pass around servicehub and get error info from AutoMLService directly. Now since you already have mlnet.cli installed from previous step, to launching a GPU training is fairly easy with a small trick: replacing tf-cpu binary with tf-gpu binary. I'll post details step below.

To verify that GPU is involved in the training, you should see something like this in the output

2021-08-27 10:57:37.058524: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4746 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce GTX 1660, pci bus id: 0000:65:00.0, compute capability: 7.5)
Exterminated commented 2 years ago

@LittleLittleCloud Dose i need start training via cmd?

mlnet image-classification --dataset /path/to/your/image/folder

I replaced dll's and then start training via cmd. I don't see any mentions of my gpu in logs mlnet-29-08-2021.txt

LittleLittleCloud commented 2 years ago

@Exterminated What you do looks correct. But from the log it seems that the mlnet cli is still using cpu binary to train the model. It's really strange when the cpu binary is replaced.

I notice that there's a mistake in the instruction I provided above:

%UserProfile%.dotnet\tools.store\mlnet\16.7.2\mlnet\16.7.3-dev\tools\netcoreapp3.1\any\runtimes\win-x64\native

Where the 16.7.3-dev should be 16.7.2. And the correct dest path should be

%UserProfile%.dotnet\tools.store\mlnet\16.7.2\mlnet\16.7.2\tools\netcoreapp3.1\any\runtimes\win-x64\native

So maybe that's why cli is still using cpu to run the experiment because the original TF binary is still there?

LittleLittleCloud commented 2 years ago

Hi @Exterminated

What's the largest size of your images, the error could be caused by OOM on GPU and is most likely because of some very large pictures.

beccamc commented 2 years ago

@Exterminated Is this issue resolved?

Exterminated commented 2 years ago

Hi @beccamc! Unfortunately no =( I was trying retest this a few month ago and get same errors with small pictures.

beccamc commented 2 years ago

Have you updated to latest version 16.13.1?

Exterminated commented 2 years ago

@beccamc Yes image image

beccamc commented 2 years ago

@LittleLittleCloud Are you able to help here again? It's still an issue.

LittleLittleCloud commented 2 years ago

Yeah sure,, we no longer use image classification api from AutoML.Net since v16.13.1 so it might provides more logs and more information to trace. @Exterminated would you be able to share with us model builder log again.

Regards!

beccamc commented 1 year ago

I know this is an old issue. We've had a few releases. @Exterminated Any chance you can try on 16.14.0 and share logs?