dotnet / machinelearning-modelbuilder

Simple UI tool to build custom machine learning models.
Creative Commons Attribution 4.0 International
265 stars 56 forks source link

Error: Failed to validate user configuration and data #2568

Open johnseed opened 1 year ago

johnseed commented 1 year ago

System Information (please complete the following information):

Describe the bug

To Reproduce Steps to reproduce the behavior:

  1. Go to 'Object Detection(Azure)'
  2. Setup Environment
  3. Select json file
  4. Click "Train"
  5. Go to Azure ML Studio, see job failed,

Expected behavior A clear and concise description of what you expected to happen. The job in Azure ML Studio should have succeeded, and the model builder should then consume the model.

Screenshots If applicable, add screenshots to help explain your problem.

image

Additional context Add any other context about the problem here. The older version model builder seemed fine, so I rolled back to 16.13.9.2235601, and the issue disappeared.

LittleLittleCloud commented 1 year ago

@zewditu Can you take a look at this?

LittleLittleCloud commented 1 year ago

Hitting this in the current main as well. Marking it as release blocker

LittleLittleCloud commented 1 year ago

Verifying azure training over the previous release and it seems to be not working as well. Perhaps there's a break change on azure automl service side

zewditu commented 1 year ago

@LittleLittleCloud , @johnseed I get successful run today, are you able to try it again?

johnseed commented 1 year ago

@LittleLittleCloud , @johnseed I get successful run today, are you able to try it again?

It's still not working, the same error persists. image

zewditu commented 1 year ago

@johnseed it will be resolved in our next release.

johnseed commented 1 year ago

I can confirm that the old version 16.13.9.2235601 does not have this issue, but the old version 16 is unable to select V100 or T4 Compute. Due to the large amount of data, the training times out. These accumulated issues have completely hindered my progress, causing not only financial loss from failed training but also impacting my project, which relies on this. I am currently under tremendous pressure. image

LittleLittleCloud commented 1 year ago

@johnseed Really sorry to hear that. We are going to release early next week which will include the fix for this issue.

LittleLittleCloud commented 1 year ago

Also @zewditu, is it the case that @johnseed can overcome this issue by creating a training compute from another region

zewditu commented 1 year ago

@johnseed probably your region is not westus2? you can use resources created in westus2 region to unblock you till our next release

johnseed commented 1 year ago

@johnseed probably your region is not westus2? you can use resources created in westus2 region to unblock you till our next release

I don't know what it has to do with westus2, as all my compute resources are in eastus. Regardless, I managed to use the T4 GPU, and I hope it works. image

LittleLittleCloud commented 1 year ago

@johnseed We just release a newer version with the fix for this issue, please let us know if it fixes this problem

johnseed commented 1 year ago

@johnseed We just release a newer version with the fix for this issue, please let us know if it fixes this problem

Thank you for the update! I'll try the new version. I'll let you know if it resolves the issue.

johnseed commented 1 year ago

@johnseed We just release a newer version with the fix for this issue, please let us know if it fixes this problem

image

Can't say it's perfect, but it's working now, thank you very much!

zewditu commented 1 year ago

@johnseed what makes you "Can't say it's perfect, but it's working now" ? let us know

jpcintegral commented 1 year ago

Hi, I am having the same problem today

johnseed commented 1 year ago

@johnseed what makes you "Can't say it's perfect, but it's working now" ? let us know

LittleLittleCloud commented 1 year ago

@jpcintegral can you share the model builder version you are using? And is the azure training fails with the same error information on azure portal as well.