dotnet / machinelearning-modelbuilder

Simple UI tool to build custom machine learning models.
Creative Commons Attribution 4.0 International
263 stars 56 forks source link

Advanced Data Options - Data Split & Validation Dataset #2396

Open luisquintanilla opened 1 year ago

luisquintanilla commented 1 year ago

As part of advanced data options, allow users to:

The specification is here

zewditu commented 1 year ago

Having a separate validation dataset have an update on our mbconfig file , I proposed this because the validation dataset and train dataset should have the same schema. However if we are able to set train dataset from file and validation dataset from sql or vice versa that would have a separate structure. @beccamc @LittleLittleCloud , @JakeRadMSFT thoughts?

beccamc commented 1 year ago

Good point! I think it's a reasonable limitation that the validation dataset and train dataset have to come from the same data source (file or SQl)

LittleLittleCloud commented 1 year ago

I feel like allowing different data sources for train && validation datasets is still necessary. One of the reasons is it (allowing different data source) can satisfy the situation of having validation and train dataset from one source. But it's not the case vice versa. Moreover, the situation might exist when train dataset is from real world (like sql server in production environment) and might be updated while validation dataset is from local and doesn't change. In which case, allowing different data source for train && validation is a requirement.

About the change of mbconfig, I would agree with @zewditu to have a seperate schema section in data source. Here's a proposal based on our discussion in engineering sync meeting about 3 months ago.

beccamc commented 1 year ago

Is the UI for selecting the validation dataset inside of the Advanced Options dialog @zewditu?

zewditu commented 1 year ago

I saw Rohan's share everywhere some of them they seems outside of advanced data option some of them are inside advanced data option , I am not sure which one is the final design we agreed https://www.figma.com/file/qFoQ4iC27jpNZb7CCuo7Hj/Model-Builder---RM?node-id=4474%3A443415&t=WfE2xH0U0i8TEUD3-0 https://www.figma.com/proto/qFoQ4iC27jpNZb7CCuo7Hj/Model-Builder---RM?page-id=4474%3A443415&node-id=4553%3A451101&scaling=min-zoom&starting-point-node-id=4553%3A447283&show-proto-sidebar=1

beccamc commented 1 year ago

I feel like placing two File/SQL selector UIs on the main data page would be overwhelming (I can make a mockup screenshot if anyone wants to see). I see a few ways of moving forward...

  1. Put validation on data page. Only allow text files for the validation set. Then validation set can be selected from the main UI fairly easily
  2. Put the validation selection inside of Advanced Data Options. It would be reasonable to put the File/SQL selector UI inside of this dialog to allow either option. This uses XiaoYun's recommended .mbconfig.

We could also follow option 2 but not add the SQL support yet.

zewditu commented 1 year ago

@beccamc I am thinking on the reverse for #1 why it only allows text files for validation, what do you mean only allows text file? this screenshot shows we are able to select file/sql image

It is overwhelming, yes, I agreed.

2 The Ui says upload we do not have option

image

LittleLittleCloud commented 1 year ago

If putting two dataset selectors on one page is too overwhelming for UI, I feel like the second option is more promising. And like @beccamc says, we can add support for SQL in advanced data option later.

beccamc commented 1 year ago

I also like having only required elements in the main UI. @luisquintanilla do you have thoughts on this discussion?

LittleLittleCloud commented 1 year ago

One drawback of putting Split && Validation Option in Advanced Data Option is it will be odd if the user wants to go back and retrain, especially if they want to change the validation way, like adjusting split ratio or changing from cross-valiation to use a separate validation dataset.

After discussing privately with @JakeRadMSFT, we still decided to put Split && Validation Option wizard in Advanced Data Option because it will be easier to preview the validation dataset and show warning messages if there're mistakes like mismatching columns. We feel that it will be more useful versus adjusting the validation option in training page rather than data page.

Any other thoughts on where to put Data Split & validation option? @beccamc @luisquintanilla @zewditu

luisquintanilla commented 1 year ago

Hey all,

Thanks for the discussion. So there's a few questions on this thread I'll try and summarize:

  1. Should the data sources be the same or different? For v1 I think we can start off by requiring the data source the same. Though I agree with XiaoYun that it's likely the case training and validation sets will come from different sources so that should be the end goal.
  2. Where should the validation data selector be? I don't have a strong preference though I can see Option 1 being overwhelming especially since it's not a requirement. For Option 2 since it leverages the mbconfig proposal and also enables multiple data sources which is the end goal, I'm more inclined to go that route.
  3. Where should the data split & validation options be? I think in the Data page makes sense.

cc @beccamc @LittleLittleCloud @JakeRadMSFT @zewditu

Also, I'd ask for feedback from some of our community members

@andrasfuchs @torronen @jwood803 feedback is greatly appreciated on the thread above🙂

andrasfuchs commented 1 year ago

Thank you for asking us, I appreciate it! I had no permission to open the specification and the Figma files above.

My project has limited scope and I haven't used Model Builder for a few months, so probably my opinion isn't that useful, but I have a few remarks:

  1. Should the data sources be the same or different? My data sources always had the same CSV format, but I can totally see that there will be cases where you need to get the train and validation data from different sources.
  2. Where should the validation data selector be? Looking at the two versions I think it would be better to have the selector in advanced data options. On the other hand I think it would be good to have at least a basic, read-only description about the validation method that will be used on the Data page (based on the current settings in advanced data options).
  3. Where should the data split & validation options be? They would be ok on the advanced data options if we had a summary and a link to edit on the Data page as I described in (2).
torronen commented 1 year ago

Sorry to be a "scope creeper" but for my cases sampling key would have big importance. I think it's importance might often be overlooked.

  1. I would prefer this: Same source with sampling key OR two sources (can be of same type, but at least different files or database queries) BUT NOT same source with random selection

  2. & 3. No opinions / preference. Simple is good, but on other hand, I just slightly worry if there will be lots of ML-enabled apps which have never been validated in anyway.

Sometimes I've had the problem that validation and training files do not have exact same columns. I am not sure if anything can be done about it in model builder, just FYI.