dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.91k stars 1.86k forks source link

Model builder training appears to leak data somehow into the training set #7164

Closed pjsgsy closed 4 weeks ago

pjsgsy commented 1 month ago

Windwos 11 ML.Net 3.0.1 .Net 4.8

When traingin a large csv my model would get consistently high results that I could not replicate in testing outside of model builder. I was letting model builder handle the trainign/validation split, though I tried all those options. Folds, 70/30, 80/20, etc. Always ended up >90% micro accuracy over training time if left, but never got even close when run in real time. After many days - I today split the SAME data file into 2 different files, telling model builder the validation data is in that separate file, and hey presto, can;t train more than 45%... This is better (for worse!). The 2 files are a 80/20 split - I just did it myself. Give model builder the whole file and tell it to do the 80/20 split, and it will train to >93% again. Something in there is broken it seems! So little visibility for me into what is going on, I don't have much more to offer in terms of what. it would appear the validation data is somehow leaked into the training set.

Seperate validation file image

Combined file letting model builder do the split will train to >0.93, same data and metrics.

Model builder version is 17.18.2.2415501

pjsgsy commented 4 weeks ago

Posting in the model builder issues instead