Error: Failed to validate user configuration and data

johnseed commented 1 year ago

System Information (please complete the following information):

Model Builder Version (available in Manage Extensions dialog): 17.14.4.2313404
Visual Studio Version: Microsoft Visual Studio Enterprise 2022 (64-bit) - Current Version 17.5.3

Describe the bug

On which step of the process did you run into an issue: Object detection on Azure, Training step
Clear description of the problem: Initially, there was no prompt from the model builder. In Azure ML Studio, the job failed with the error message, "Error: Failed to validate user configuration and data. 1. Service failed to retrieve the data. Ensure data correctness and availability.", Eventually, the model builder encountered "A network issue has been detected. ", or "Didn't find a child run."

To Reproduce Steps to reproduce the behavior:

Go to 'Object Detection(Azure)'
Setup Environment
Select json file
Click "Train"
Go to Azure ML Studio, see job failed,

Expected behavior A clear and concise description of what you expected to happen. The job in Azure ML Studio should have succeeded, and the model builder should then consume the model.

Screenshots If applicable, add screenshots to help explain your problem.

Additional context Add any other context about the problem here. The older version model builder seemed fine, so I rolled back to 16.13.9.2235601, and the issue disappeared.

LittleLittleCloud commented 1 year ago

@zewditu Can you take a look at this?

LittleLittleCloud commented 1 year ago

Hitting this in the current main as well. Marking it as release blocker

LittleLittleCloud commented 1 year ago

Verifying azure training over the previous release and it seems to be not working as well. Perhaps there's a break change on azure automl service side

zewditu commented 1 year ago

@LittleLittleCloud , @johnseed I get successful run today, are you able to try it again?

johnseed commented 1 year ago

@LittleLittleCloud , @johnseed I get successful run today, are you able to try it again?

It's still not working, the same error persists.

zewditu commented 1 year ago

@johnseed it will be resolved in our next release.

johnseed commented 1 year ago

I can confirm that the old version 16.13.9.2235601 does not have this issue, but the old version 16 is unable to select V100 or T4 Compute. Due to the large amount of data, the training times out. These accumulated issues have completely hindered my progress, causing not only financial loss from failed training but also impacting my project, which relies on this. I am currently under tremendous pressure.

LittleLittleCloud commented 1 year ago

@johnseed Really sorry to hear that. We are going to release early next week which will include the fix for this issue.

LittleLittleCloud commented 1 year ago

Also @zewditu, is it the case that @johnseed can overcome this issue by creating a training compute from another region

zewditu commented 1 year ago

@johnseed probably your region is not westus2? you can use resources created in westus2 region to unblock you till our next release

johnseed commented 1 year ago

@johnseed probably your region is not westus2? you can use resources created in westus2 region to unblock you till our next release

I don't know what it has to do with westus2, as all my compute resources are in eastus. Regardless, I managed to use the T4 GPU, and I hope it works.

LittleLittleCloud commented 1 year ago

@johnseed We just release a newer version with the fix for this issue, please let us know if it fixes this problem

johnseed commented 1 year ago

@johnseed We just release a newer version with the fix for this issue, please let us know if it fixes this problem

Thank you for the update! I'll try the new version. I'll let you know if it resolves the issue.

johnseed commented 1 year ago

@johnseed We just release a newer version with the fix for this issue, please let us know if it fixes this problem

Can't say it's perfect, but it's working now, thank you very much!

zewditu commented 1 year ago

@johnseed what makes you "Can't say it's perfect, but it's working now" ? let us know

jpcintegral commented 1 year ago

Hi, I am having the same problem today

johnseed commented 1 year ago

@johnseed what makes you "Can't say it's perfect, but it's working now" ? let us know

Upload reliability ：In poor network conditions, even if the upload fails or is not completely successful, the model training process will still continue.
Compute Cluster Size：The available Compute Cluster Sizes are limited. My project has over 30,000 images, and I would like to use either V100 x 8 or A100 x 8 for compute. Although ML Studio can create a V100 x 8 Compute, it cannot be selected in Model Builder.
Undocumented gap ：There is a lack of documentation, and some features are not well explained. For example, it is not clear how to use MLFlow when the model results shown in ML Studio are from MLFlow. Additionally, the process of downloading and using ONNX models is very complex. On the other hand, MLFlow is easier to use. Currently, I download MLFlow models directly, deploy them using Docker, and wrap them as API calls. However, there is no documentation on how to use MLFlow models, and I had to figure it out on my own in a day.
Resource selection bug：There are some minor bugs in the resource selection interface, causing many exception dialogs to pop up. However, I did not record them due to the time elapsed.
Start over every time：The training cost is high, and we have new images every week to add to the model training. However, we have to start training from scratch each time instead of continuing from previous results.
Use GPU：There is a lack of documentation for using GPUs with trained models, including code configuration and the installation of CUDA and CuDNN. I had to figure it out on my own.
It doesn't feel like using a mature software, but more like reverse engineering, as many features require self-exploration and problem-solving.

LittleLittleCloud commented 1 year ago

@jpcintegral can you share the model builder version you are using? And is the azure training fails with the same error information on azure portal as well.

dotnet / machinelearning-modelbuilder

Error: Failed to validate user configuration and data #2568