Open torronen opened 3 years ago
@torronen Thanks for your advice. The dataset sharing plan we decide to use is through Azure dataset, but we are still working on that and it will take a while before it's available. Azure dataset will help resolve the security/authenticate concern and dataset sharing among local/cloud training. But sharing through URI is also a feasible solution with the only downside is user needs to take their own risk to download from a possibly unknown source.
@beccamc @briacht This sounds like a work-around for sharing dataset before Azure Dataset gets ready? And it sounds like a perfect solution for scenarios like online tutorial and sharing inside teams.
@beccamc Can you take a look at it if you have time?
I've also considered this as a work around until DataSets are ready. We need a prioritization of this.. I think Antti brings up a great point this would be useful for sampls/tutorials even after datasets are implemented.
Is your feature request related to a problem? Please describe. A multi-location or remote team using Git may have datasets stored in different locations which prevents using .mbconfig file for retraining without editing the JSON. Dataset may also be updated and need to be sent for all team members. Ideally, there would be a way to request the latest dataset when starting training.
Describe the solution you'd like One way would be to allow putting a URL in the file path (file source?) of the JSON file.
Model Builder would then download the dataset from that location upon start. It would allow data preparation team to simply update one location with the latest file, and all locations would always use the latest versions. As a copy of the original dataset is anyway made from the origin, so I suppose it could as well be downloaded from an online location.
A bonus would be a simplification of ML.NET tutorials because users could just download the .mbconfig file and just run it.
Additional context This might also allow creation of "training agents" by MS or the community. Agent could just get .mbconfig file and train it.
It should be considered what to do with the file after the training has finished. It might be a good idea to delete it, or ask user if it should be deleted, to not consume disk space with something the user did not explicitly download.
If .mbconfig files would be shared freely, it may be a security concern as the .mbconfig could request to download malicious files. Anyway, I think the benefits outweight the risks at least for data science teams, but the risks may be greater for teams or developer with less need for dataset sharing. Potentially, showing a dialogue "This file requests to download .....Make sure it is trusted location ... OK?" could be enough.