Open luisquintanilla opened 1 year ago
Having a separate validation dataset have an update on our mbconfig file , I proposed this because the validation dataset and train dataset should have the same schema. However if we are able to set train dataset from file and validation dataset from sql or vice versa that would have a separate structure. @beccamc @LittleLittleCloud , @JakeRadMSFT thoughts?
Good point! I think it's a reasonable limitation that the validation dataset and train dataset have to come from the same data source (file or SQl)
I feel like allowing different data sources for train && validation datasets is still necessary. One of the reasons is it (allowing different data source) can satisfy the situation of having validation and train dataset from one source. But it's not the case vice versa. Moreover, the situation might exist when train dataset is from real world (like sql server in production environment) and might be updated while validation dataset is from local and doesn't change. In which case, allowing different data source for train && validation is a requirement.
About the change of mbconfig, I would agree with @zewditu to have a seperate schema section in data source. Here's a proposal based on our discussion in engineering sync meeting about 3 months ago.
Is the UI for selecting the validation dataset inside of the Advanced Options dialog @zewditu?
I saw Rohan's share everywhere some of them they seems outside of advanced data option some of them are inside advanced data option , I am not sure which one is the final design we agreed https://www.figma.com/file/qFoQ4iC27jpNZb7CCuo7Hj/Model-Builder---RM?node-id=4474%3A443415&t=WfE2xH0U0i8TEUD3-0 https://www.figma.com/proto/qFoQ4iC27jpNZb7CCuo7Hj/Model-Builder---RM?page-id=4474%3A443415&node-id=4553%3A451101&scaling=min-zoom&starting-point-node-id=4553%3A447283&show-proto-sidebar=1
I feel like placing two File/SQL selector UIs on the main data page would be overwhelming (I can make a mockup screenshot if anyone wants to see). I see a few ways of moving forward...
We could also follow option 2 but not add the SQL support yet.
@beccamc I am thinking on the reverse for #1 why it only allows text files for validation, what do you mean only allows text file? this screenshot shows we are able to select file/sql
It is overwhelming, yes, I agreed.
If putting two dataset selectors on one page is too overwhelming for UI, I feel like the second option is more promising. And like @beccamc says, we can add support for SQL in advanced data option later.
I also like having only required elements in the main UI. @luisquintanilla do you have thoughts on this discussion?
One drawback of putting Split && Validation Option in Advanced Data Option is it will be odd if the user wants to go back and retrain, especially if they want to change the validation way, like adjusting split ratio or changing from cross-valiation to use a separate validation dataset.
After discussing privately with @JakeRadMSFT, we still decided to put Split && Validation Option wizard in Advanced Data Option because it will be easier to preview the validation dataset and show warning messages if there're mistakes like mismatching columns. We feel that it will be more useful versus adjusting the validation option in training page rather than data page.
Any other thoughts on where to put Data Split & validation option? @beccamc @luisquintanilla @zewditu
Hey all,
Thanks for the discussion. So there's a few questions on this thread I'll try and summarize:
cc @beccamc @LittleLittleCloud @JakeRadMSFT @zewditu
Also, I'd ask for feedback from some of our community members
@andrasfuchs @torronen @jwood803 feedback is greatly appreciated on the thread above🙂
Thank you for asking us, I appreciate it! I had no permission to open the specification and the Figma files above.
My project has limited scope and I haven't used Model Builder for a few months, so probably my opinion isn't that useful, but I have a few remarks:
Sorry to be a "scope creeper" but for my cases sampling key would have big importance. I think it's importance might often be overlooked.
I would prefer this: Same source with sampling key OR two sources (can be of same type, but at least different files or database queries) BUT NOT same source with random selection
& 3. No opinions / preference. Simple is good, but on other hand, I just slightly worry if there will be lots of ML-enabled apps which have never been validated in anyway.
Sometimes I've had the problem that validation and training files do not have exact same columns. I am not sure if anything can be done about it in model builder, just FYI.
As part of advanced data options, allow users to:
The specification is here