dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.99k stars 1.88k forks source link

Pretrained DNN Image Featurization #1232

Closed vaeksare closed 5 years ago

vaeksare commented 5 years ago

Support for a DNN Image Featurizer Transform is to be added to ML.NET. This will allow users to use 1 of 4 pretrained DNN models (ResNet18, ResNet50, ResNet101, and AlexNet) trained on ImageNet in order to featurize an input image.

This transform will use the ONNX Transform as the backbone of doing input preprocessing and applying the pretrained DNN model.

vaeksare commented 5 years ago

@markusweimer

wschin commented 5 years ago

Do we want to have those pre-trained model as a separated nuget? I'd expect their sizes are not small.

mdabros commented 5 years ago

@wschin From a users point of view, I would prefer to have each pre-trained model in separate nuget package. Both to minimize the size of the packages, but also to have a separate "assembly" version for each model, so you can easily track if a pre-trained model have been updated. Changes/updates to the pre-trained models can have significant impact on the ML products build on top of those.

justinormont commented 5 years ago

I would recommend placing the pre-trained models in the CDN and downloading them on first use. This is the method used elsewhere in ML.NET.

We should consider how/where to store the original models, ONNX conversion script, and instructions, so that a user can re-create the ONNX files. Perhaps we can start another repo (eg: https://github.com/dotnet/machinelearning-resources/) to store these larger files in a more accessible form.

markusweimer commented 5 years ago

@justinormont, I'd like to understand the CDN idea a bit better. When would the models be downloaded from the CDN? I can see a couple distinct times during the creation of a pipeline:

I don't see a huge win of a CDN over the NuGet approach. Either way, the model should become part of the app after compilation and training. Downloading resources in an app from a CDN is problematic, as the app developer looses control over that part of their app that way.

Zruty0 commented 5 years ago

@vaeksare , are you currently working on it? If yes, please assign to yourself.

vaeksare commented 5 years ago

At this point, it seems like the NuGet approach is the best way, with an individual NuGet being created for each model (and the baseline Transform code residing in the OnnxTransform NuGet since that is the minimum required to run it, and the actual transform code is so small that it doesn't make sense to put it into its own separate distribution). @Zruty0 you mentioned extension methods earlier, do we believe those are the best way to achieve functionality for each additional model?

Zruty0 commented 5 years ago

Yes :)

vaeksare commented 5 years ago

Actually a slight concern regarding the extension method approach. Currently, if we are to stick to how other similar transforms work (and how this transform worked in the past), the model to use would be passed in to the constructor of the transform as an argument. In this case, If we use this approach again, I don't believe the implementation can be done using extension methods. Rather, the user would be told to/need to download the correct NuGet containing that transform in order for model argument that is passed in to work.

Zruty0 commented 5 years ago

As far as I understand, currently the 'model to use' is passed as an enum to the constructor of the transform.

Since we cannot have 'extension enums' in the language, but we do have extension methods, the suggestion was to change the creation interface. For example, before it was:

new DnnImageFeaturizer(model: DnnImageModel.ResNet) // DnnImageModel is an enum

and now it could be

new DnnImageFeaturizer(m => m.ResNet()) // here, ResNet() is an extension to some 'model selector' subclass
// or
new DnnImageFeaturizer().ResNet() // here, ResNet() is an extension to DnnImageFeaturizer itself
vaeksare commented 5 years ago

There is a slight issue with the NuGet approach. Normally, we store all of our NuGet sources on GitHub, but these models are too large to be posted to the ML.NET GitHub (and large enough that they can't be posted to GitHub period without being compressed). These could be uploaded manually to NuGet, but this does not seem like an ideal solution. Any thoughts? Also see my PR #1447 for the current proposed implementation.

markusweimer commented 5 years ago

Can we host the models on Azure and pull them in during the build process when preparing the nugets?

vaeksare commented 5 years ago

Had another discussion with @eerhardt regarding different possible implementations of this.

Hosting the models on Azure and pulling them in during the build process is possible (in fact this is what we do for TensorFlow), but causes the issue of making our CI builds slower due to requiring the download of these models each time the CI build is called. An alternative would be to separate these altogether into a new repo (so they are not a part of our CI), but this causes the issue of having to remember to update them if any changes that affect the transform are made to ML.NET.

Alternatively, I have also further investigated the CDN approach. With CDN, the models would be downloaded (also from Azure) the first time a model is trained using a specific model. While they are not automatically a part of the app afterwards, the app developer can still choose to manually supply them with their app to prevent their download on the client side. So this actually gives them more control, as this gives them the option of supplying the models with the app (and making the app bigger), or letting the download happen client side.

The big thing we lose with the CDN approach is versioning. However, these specific models are rather old and have been thoroughly tested/used for some time now, so I do not believe they are likely to change anytime soon. New models can be added, but these are unlikely to change. And we can always provide version numbers at the end of filenames to guarantee the proper versioning to users.

So after digging into this more, I believe the CDN approach is better for this, and follows what we do for other similar transforms in ML.NET, but would like to hear if other people agree with my assessment.

markusweimer commented 5 years ago

With CDN, the models would be downloaded (also from Azure) the first time a model is trained using a specific model.

An alternative is that the NuGet we produce does that at install time, and also registers them for inclusion in the project output folder. Conversely, uninstalling those nugets would undo that. There used to be a mechanism in nuget to run scripts at those times. Does that still exist and is it supported?

justinormont commented 5 years ago

One gain on the CDN approach is all flavors of ML.NET { TLC, GUIs, NimbusML, AutoML, etc } will automatically download the model files as needed.

This eases our overall effort in creating distributions. We won't need an independent strategy to distribute these flavors of ML.NET as tiny distributables. We simply exclude the models from each flavors' installation and, as needed, the models magically appear as the user instantiates the featurizer calling for the model.

The main hiccup I see is we need to better document the 'correct' locations for the user to place the files for production. The code documents, but we should better advertise: https://github.com/dotnet/machinelearning/blob/5e08fa1ea7bfb54f28ed0815cb6413e0068e6dd1/src/Microsoft.ML.Core/Utilities/PathUtils.cs#L36-L43

@markusweimer: Your idea of a shell NuGet sounds interesting. I would purpose we do the CDN approach for the gains listed above, plus (what I think you are saying) a NuGet option which runs a script which calls the equivalent of dotnet MML.dll ShowData data=dummy.tsv xf=DNNImageFeaturizer{ dnnmodel​=resnet101 } which will cause the required resnet101 model to download and registers the DNN model for inclusion in the project output folder.

mdabros commented 5 years ago

@vaeksare It seems you guys are going for the CDN approach. Will this result in no versioning at all for the models? I know that these models are "old and stable", but as I see it, this is no guarantee that they won't be updated at some point. Is there some alternative to the assembly version that can be used to identify the specific model with the CDN approach?

justinormont commented 5 years ago

@mdabros: Perhaps we can take a simple route to versioning like adding a version number to the CDN files, or adding file hash checking to EnsureResource(): https://github.com/dotnet/machinelearning/blob/c45089f614bc9665dff5e4b5c17c4e1c66854cb0/src/Microsoft.ML.Core/Utilities/ResourceManagerUtils.cs#L99

vaeksare commented 5 years ago

@mdabros I agree with Justin, in that a primitive versioning can just be added by adding a version number to CDN files. In essence, this is a large part of what NuGet does (though it provides some other guarantees about versioning which in this case could be broken). But given that these models are not expected to change much, I think just adding a version number to them, and trusting that in rare cases they are updated the person doing the updating doesn't screw up and also changes the version number is sufficient. If the mistakes are a big enough concern, could also add the file hash checking like Justin suggests.

vaeksare commented 5 years ago

@markusweimer: Is the main concern with the CDN approach the lack of an automatic download during the install time? If we make it clear in documentation where they can include the model with project output, isn't that preferable to actually forcing them to include it through the NuGet approach? As long as we are clear on how they can include it, this actually gives them more choice by giving them the option of making it a part of their app or not. Or do you think the downsides of losing this automation outweigh the benefits of using CDN?

markusweimer commented 5 years ago

My concern is that we introduce code that may make (seemingly) random HTTP calls from deployed apps. This concerns stems from the observation that dev environments are usually way more lenient in terms of networking than deployment environments. Hence, many devs might write a perfectly working pipeline that passes unit tests and all, but fails at production with a networking error. To me, this seems like accidental complexity we should avoid.

I am not against giving devs the option to have the models separately downloaded from their apps. But the default behavior should be that all resources needed by the app are part of the app itself, not downloaded.

mdabros commented 5 years ago

@justinormont @vaeksare Thanks for taking the time to respond to my model versioning concern. The most important thing as I see it, is to make the developer of an ML product aware of changes to a pre-trained model provided by ml.net. So the developer can make an active choice if he/she want's to update to the latest version, knowing that this might require retraining of ML models build on top of this. The versioning methods you guys suggests should make this possible, so thanks for the clarification . Even though it won't be quite as "automatic" as with the nuget solution :-)