Open Exterminated opened 2 years ago
@LittleLittleCloud We have another The JSON-RPC connection with the remote party was lost before the request could complete
error. This one is on Dev16. Do you have any ideas?
The error The JSON-RPC connection with the remote party was lost before the request could complete error
simply indicates that the training process was terminated unexpectedly. From the log provided, it seems that the training process was terminated after RawByteImageLoading
and before image classification
, which is probably where training is broken.
The reason why the training process being terminated can be various. It could be OOM in system, GPU driver issue or other things. Based on the log provided, there's little way for us to find out which is the root cause.
It would be of great help if you can rule out a few possible causes from framework side by running the project (You don't need to wait the training to complete, as long as it starts training the experiment can be deemed as success). It's basically a call into MLNet image classification API, which is the same one used by Model Builder in image classification scenario.
Thank you for responding to my problem! I started the project from the last comment. It looks fine Results in file imageClassificationSample-24-08-21.txt
@Exterminated
Thanks for the response. So it seems that mlnet framework itself doesn't have problem. And let's try if we can rule out the automl service as well by running the two experiments below:
Try CPU training in ModelBuilder by selecting CPU in Environment step
Try image classification in mlnet cli.
dotnet tool install -g mlnet --add-source https://mlnetcli.blob.core.windows.net/mlnetcli/index.json
mlnet image-classification --dataset /path/to/your/image/folder
If both experiments can be run without problem, we can assume that the error might be in servicehub side
Yep, both experiments was successful. What i should check/reinstall with servicehub side? mlnet-image-test-25-08-2021.txt
We found another issue #1543 which has the same error with this one.
@Exterminated
The result looks interesting... Because the first experiment is to verify if servicehub works and the second is to verify if training code works. Since it's both working, theoratically GPU training should also work without any problem. So we might need to dig deeper into this issue. And it will be useful if you can provide us with servicehub logs, through the following way
Servicehub logs are located in %temp%servicehub folder. And you can get all-level logs through the following steps
And could you also check your GPU memory usage while launching GPU training, especially checking if there's a spike or increase in GPU memory usage before training fail. Because we are suspecting that the error might be caused by GPU OOM error while initializing cudnn. If so, you could try the following method mentioned in this reply
Seems my GPU don't involve in training at all( There is no any load when i start to train models.
I have geforce 750ti, maybe this card don't fit for training?
I collected 2 servicehub logs, with TF_FORCE_GPU_ALLOW_GROWTH = false and TF_FORCE_GPU_ALLOW_GROWTH = true
Logs.zip
No I don't think so because the GPU training was successful in the project I gave you. So Gtx750ti should be fine.. Other than that, the log you shares also looks.... normal.. So the problem might be still on AutoMLService side. The problem is we don't know what's the error, it might be tf binary not found, some initialize problem or whatever, the sad thing is servicehub doesn't return any error information on it.
If you can do us a favor (again) and try launching GPU training through mlnet.cli directly, then we might be able to pass around servicehub and get error info from AutoMLService directly. Now since you already have mlnet.cli installed from previous step, to launching a GPU training is fairly easy with a small trick: replacing tf-cpu binary with tf-gpu binary. I'll post details step below.
To verify that GPU is involved in the training, you should see something like this in the output
2021-08-27 10:57:37.058524: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4746 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce GTX 1660, pci bus id: 0000:65:00.0, compute capability: 7.5)
@LittleLittleCloud Dose i need start training via cmd?
mlnet image-classification --dataset /path/to/your/image/folder
I replaced dll's and then start training via cmd. I don't see any mentions of my gpu in logs mlnet-29-08-2021.txt
@Exterminated What you do looks correct. But from the log it seems that the mlnet cli is still using cpu binary to train the model. It's really strange when the cpu binary is replaced.
I notice that there's a mistake in the instruction I provided above:
%UserProfile%.dotnet\tools.store\mlnet\16.7.2\mlnet\16.7.3-dev\tools\netcoreapp3.1\any\runtimes\win-x64\native
Where the 16.7.3-dev
should be 16.7.2
. And the correct dest path should be
%UserProfile%.dotnet\tools.store\mlnet\16.7.2\mlnet\16.7.2\tools\netcoreapp3.1\any\runtimes\win-x64\native
So maybe that's why cli is still using cpu to run the experiment because the original TF binary is still there?
Hi @Exterminated
What's the largest size of your images, the error could be caused by OOM on GPU and is most likely because of some very large pictures.
@Exterminated Is this issue resolved?
Hi @beccamc! Unfortunately no =( I was trying retest this a few month ago and get same errors with small pictures.
Have you updated to latest version 16.13.1?
@beccamc Yes
@LittleLittleCloud Are you able to help here again? It's still an issue.
Yeah sure,, we no longer use image classification api from AutoML.Net since v16.13.1 so it might provides more logs and more information to trace. @Exterminated would you be able to share with us model builder log again.
Regards!
System Information (please complete the following information):
Describe the bug
I am trying to take the first steps in a home project, I decided to try to train the network using my local video card, but I ran into the problem "The JSON-RPC connection .." After searching Google for information, Microsoft recognized this error and they fixed it in 16.5.4, but my environment is newer. I tried to restore the installation via Visual Studio Installer, but to no avail. Has anyone encountered such a problem? Can someone suggest how to deal with it.
To Reproduce Steps to reproduce the behavior:
Expected behavior Training proceeded without error and i can move to "Evaluate" step
Screenshots
![image](https://user-images.githubusercontent.com/4963117/130353295-568f6258-b34c-4e64-89b7-a54ade38777d.png)
Additional context 06428ff7-b6fc-4b0a-88b4-4156f15cf916.txt c652506b-4dd5-4f80-8abe-efbc08b38f84.txt 9754cd7b-5563-44ff-abfd-8f684fbf7c68.txt