Closed jkotas closed 6 years ago
So what do you think, MEF, or other?
Over the years we've toyed with the idea of replacing the scheme with MEF but never quite got beyond the "yeah might be nice some day" vagueness. @veikkoeeva mentioned the same idea in #136 actually... It has basically the same idea: multiple implementations of types associated with loadnames, even has a so-called DirectoryCatalog for cases where we want to keep the current system, allows other more explicit schemes for cases where you don't.
So what do you think, MEF, or other?
My top concern is to make it possible to use ML.Core without heavy-weight dependency injection frameworks.
I do not have opinion on which dependency injection framework is the right one. I believe that people typically like to have an option to bring their own.
I see this is your first issue. Welcome @jkotas ! Sorry, I thought I understood the request, but I see I don't. The usage of a some sort of dependency injection framework is a pretty core central concept to how the entire system is able to work at all. Even in just the code that's in this repo so far, I see some hundreds of components advertising themselves as a loadable component and roughly one hundred calls to the component catalog in one place or another to instantiate the corresponding components; and we haven't even published the command line or GUI interfaces yet. I mean, let's just start with model loading. What's your plan?
I do not have a good feel for the overall ML.Core state yet, so I do not have a good detailed answer to your question.
I think the problem is similar to ASP.NET Core: The lower layers (building blocks) of ASP.NET Core do not have a hard dependency of dependency injection frameworks. The dependency injection frameworks come into play only for the higher levels of ASP.NET Core like MVC. If people want to use ASP.NET without paying the price for the dependency injection frameworks, they can still do that just fine.
@KrzysztofCwalina What do you think about the ML.Core being joined at the hip with a dependency injection framework (its own custom one currently)?
Ah OK. Well, again, welcome, and I hope you'll spend some time getting to know the structure of the code a bit more. Let any of us know if you have questions. While I'm sure many libraries (including among them ASP.NET) don't benefit from dependency injection, this library does.
Yes, I in general worry that ML.NET has too many potentially independent technologies joined at the hip. I talked to @glebuk about it and we agreed to work on simplifying ML.Core.
ASP.NET and ML.NET are a bit different - the entire ML.NET section of the framework that deals with training and inference relies on having hundreds of independent components (trainers, transforms, loaders, loss functions and so forth) We want to have runtime component binding within the framework.
We should be able to drop a DLL and be able to use it, without having to recompile the common infra. The component catalog allows us to find and load needed components, it also allows us to do customization of a given ML deployment where only needed DLLs are placed.
Moreover, it is a component that allows a runtime binding for higher level APIs, such as GUIs, command line tools and language bindings.
Parts of the framework that do not deal with any components (IDV and related) do not need this, however any time you create a pipeline or try to use any components in any way you would need to load and find them. This is at a pretty low level of the overall architecture. How should we achieve those goals if we don't have a DI/TLC Component Catalog?
The abstract concept of component catalog is ok. It is https://en.wikipedia.org/wiki/Service_locator_pattern with all its advantages and disadvantages.
The part that is not good is the messy hardcoded algorithm that populates the component catalog. I think that the best way to fix it would be do add explicit methods to register set of components that you would call at the start of the application. For example, the standard learners package can have method like:
public static class StandardLearnersExtensions
{
public static ComponentCatalog AddStandardLearners(ComponentCatalog componentCatalog);
}
And one would register them by calling this method at the start of the application:
componentCatalog.AddStandardLearners()
If you would like to keep the model where all services that happen to be present in the app are registered, it should be opt-in by calling an extension method like componentCatalog.AddComponentsByConvention()
at the start of the application. And the implementation of AddComponentsByConvention
should get the list of the assemblies that the application is composed from by calling proper APIs, not by enumerating files in a directory. The natural way to add .dlls to .NET Core app is by adding their package reference to the app .csproj file, not by manually dropping the .dll to the app directory.
Enumerating files in a directory and applying ad-hoc filter to them is not compatible with .NET Core and other .NET app models. I would not include this option in MI.NET. If you really need it to keep it around for existing users, I would move it into a obsoleted compat library that folks do not get by default.
If it helps, here is an example of how ASP.NET Core implements similar concept: https://github.com/aspnet/Mvc-Private/blob/34e4fbf92df2d181b441bbbde598b68d4f33d8b4/src/Microsoft.AspNetCore.Mvc.Formatters.Json/DependencyInjection/MvcJsonMvcCoreBuilderExtensions.cs#L15 These extension methods return its argument to make it easy to chain them with other similar methods. Typical ASP.NET apps have code like this on the initialization path:
services
.AddMvcCore()
.AddJsonFormatters()
.AddSignalR()
@jkotas @KrzysztofCwalina I've given some feedback on how the ComponentCatalog is causing pain for F# scripting here: https://github.com/dotnet/machinelearning/pull/600#issuecomment-409231443. It is also the root cause of https://github.com/dotnet/machinelearning/issues/401 which required the awkward workaround of forcing artificial loads of some referenced assemblies:
let _load =
[ typeof<Microsoft.ML.Runtime.Transforms.TextAnalytics>
typeof<Microsoft.ML.Runtime.FastTree.FastTree> ]
It is important that ML.NET work cleanly in conjunction with REPL scripting contexts (F# or C# and future possible REPL-hosted scripting in workbooks etc.). The use of DI and service-locator patterns generally makes this more painful, at least in conjunction to "normal" .NET libraries.
@Ivanidzo4ka and @eerhardt have replied with current thinking at https://github.com/dotnet/machinelearning/pull/600#issuecomment-409356035
We should be able to drop a DLL and be able to use it, without having to recompile the common infra. The component catalog allows us to find and load needed components, it also allows us to do customization of a given ML deployment where only needed DLLs are placed.
Moreover, it is a component that allows a runtime binding for higher level APIs, such as GUIs, command line tools and language bindings.
Parts of the framework that do not deal with any components (IDV and related) do not need this, however any time you create a pipeline or try to use any components in any way you would need to load and find them. This is at a pretty low level of the overall architecture.
While these might have been reasonable problems to solve for TLC, my gut feeling is that these goals are not in scope for ML.NET, except perhaps for the specific "zip" scenario mentioned in https://github.com/dotnet/machinelearning/pull/600#issuecomment-409356035. No .NET Core application allows "drop in DLL replacement" deployment models and if we did, we would allow that kind of deployment update for a much broader range of applications.
ML.NET is a machine learning framework for .NET, not a dynamic loader or dependency resolver and/or DI framework.
I'd have different feedback about IDV: that is solving a schematization problem which definitely needs to be solved for a machine learning framework and which, to my knowledge, is not solved elsewhere in the .NET core ecosystem. Ideally the solution would be extracted and used more widely.
Nothing brings more joy than throwing few principals under wheels of running train. @Zruty0 @TomFinley I'm sure they have something to say :)
Nothing brings more joy than throwing few principals under wheels of running train. @Zruty0 @TomFinley I'm sure they have something to say :)
Well, you're still only at version 0.3. The train just left the first station :)
Enumerating files in a directory and applying ad-hoc filter to them is not compatible with .NET Core and other .NET app models. I would not include this option in MI.NET. If you really need it to keep it around for existing users, I would move it into a obsoleted compat library that folks do not get by default.
I agree, this is much cleaner than what we have now.
The 'DI everywhere' is a remnant of our command line background, and, as such, it should be consigned to the separate assembly loaded as a separate package (like Microsoft.ML.CommandLine or something).
The DLL scanner, I believe, should not be in the public domain at all.
We still need to address the need to load a model from a serialized file (the comment to #600 mentioned above).
assembly.GetCustomAttributes(typeof(LoadableClassAttributeBase))
of all loaded assemblies that reference Microsoft.ML. Still not ideal, but it is better than scanning all types of all assemblies.The main usages of the ComponentCatalog today:
Note It is also being used in the PipelineInference code to do things like suggest sweepers and "recipes"
For the first three, you could imagine that the ComponentCatalog can sit on top of the "core" ML.NET library. However, usage #4 is a very core concept in ML.NET, thus the catalog cannot be separated from the core library.
One option to solve #4 is to instead write the fully-qualified .NET Type name into the model file. However, the drawback here is that ML.NET would then be locked into that .NET Type forever for that component. The namespace and class name of these types could not be changed without breaking existing models. Using a string "Name" allows ML.NET to change the Type in the future, as necessary. Additionally, solving #4 doesn't solve #1-3, so there still needs to be some sort of ComponentCatalog. We might as well embrace it as a core concept in ML.NET since it solves all usages.
The core ML.NET library should stop enumerating/loading all assemblies in the directory where the ML.NET assembly exists. This is a policy that should be decided at a higher layer than the core library. ML.NET will continue to scan loaded assemblies for components to register in the catalog.
However, this can't just be done without fixing model files. If a model file contains a component that hasn't been registered in the ComponentCatalog, it will fail to load. We can solve this by writing the Assembly.FullName
of the component into the model file. Then when loading the model, we attempt to load the assembly (using the full assembly name, not Assembly.LoadFrom(path)
), if it hasn't already been registered in the ComponentCatalog.
Alternatively, we could move to a completely explicit registration process by removing both loading assemblies in the current directory and scanning the loaded assemblies. Code external to the ComponentCatalog would be responsible for registering the components up front before attempting to do anything that required the ComponentCatalog (like loading a model). This could be done with a public API like ComponentCatalog.Register(Assembly)
, which would scan the assembly for components and register them.
However, this would change the code users need to write, since they would now be responsible for ensuring the components are registered.
We could have some sort of "automatic registration" process using something like static initializers/construtors. When a component Type gets initialized by .NET, it would register its assembly with the ComponentCatalog. This would not be without issues though, because there will still be types/assemblies that haven't been loaded yet when attempting to load a model. So we could borrow from proposal #1 above, and write the Assembly.FullName
into the model, to ensure it is loaded before creating the component.
In the short-term my plan is to implement Proposal #1 above to get rid of the worst issues we have today. Proposal #2 can be built on top of Proposal #1 in the future, if we decide that scanning the loaded assemblies is no longer acceptable.
Thoughts? Feedback?
loaded assemblies
The set of loaded assemblies returned by AppDomain.GetAssemblies()
is 100% deterministic. This API only returns assemblies that have been loaded so far.
This set is not full deterministic, e.g. it can vary based on code optimizations: More aggresive inlining will cause more assemblies to be loaded. I have seen cases where harmless JIT inliner change broke the app because of it caused different set of assemblies to be loaded and the AppDomain.GetAssemblies()
did not return what the app expected.
I think the explicit registration and scanning only what is explicitly registered should be the baseline default model.
We can go with the explicit registration model, if that is acceptable to @TomFinley, @Zruty0, and @GalOshri. It will affect the public API and examples.
One option for the explicit registration model is to embrace the Environment
class that ML.NET has, and move the ComponentCatalog
from being a static class to being an instance member of Environment
.
Then we can have extension methods (similar to ASP.NET above) off of Environment
to build up the registrations. So for example, a customer who is using some of the "standard" transforms, but also using LightGBM (a learner that does not come standard), they would write the following:
var env = new ConsoleEnvironment() // existing
.AddStandardComponents() // new necessary line
.AddLightGBMComponents(); // new necessary line
// existing code that interacts with standard components and LightGBM - possibly loading a model file.
Underneath the covers, these extension methods would work like the following:
public static TEnvironment AddStandardComponents<TEnvironment>(this TEnvironment env)
where TEnvironment : IHostEnvironment // or HostEnvironmentBase, whichever we decide to expose ComponentCatalog on
{
env.ComponentCatalog.Register(typeof(TermTransform).Assembly); // ML.Data
env.ComponentCatalog.Register(typeof(CategoricalHashTransform).Assembly); // ML.Transforms
env.ComponentCatalog.Register(typeof(SdcaMultiClassTrainer).Assembly); // ML.StandardLearners
}
And then LightGBM would define a similar extension method:
public static TEnvironment AddLightGBMComponents<TEnvironment>(this TEnvironment env)
where TEnvironment : IHostEnvironment // or HostEnvironmentBase, whichever we decide to expose ComponentCatalog on
{
env.ComponentCatalog.Register(typeof(LightGbmBinaryPredictor).Assembly);
}
After chatting with @TomFinley and @Zruty0, we've come up with a proposal that we think will address everyone's concerns. It is basically a hybrid of the above 2 proposals.
ComponentCatalog
from being a static class to being an instance member on Environment
. This has been a planned refactoring for ML.NET for a while, but hasn't been funded until now.ComponentCatalog
itself. It will have public APIs to register components, but will not do any registering itself - neither by loading assemblies from disk, nor by scanning loaded assemblies.Assembly.FullName
into the model file. We will then register that assembly with the env.ComponentCatalog
when loading the model.
Under normal circumstances, API users won't have to explicitly register components with the ComponentCatalog
. Using the API to train models won't require looking up components from a catalog - you just create .NET objects like normal. Loading a trained model from disk will register the components inside of it by loading the Assembly and scanning it for LoadableClass
assembly attributes.
During model saving, we will write the Assembly.FullName into the model file. We will then register that assembly with the env.ComponentCatalog when loading the model
This means that loading a model can trigger execution of arbitrary code that happens to be laying on disk. It has potential security issue if the model can come from untrusted place. Is it a problem?
I think, in general, loading a model from an untrusted place is going to be a problem. I wouldn't recommend it.
As for the issue of loading the assembly using the assembly name, the assembly is going to have to be loadable in the app for it to be loaded (e.g. in .NET Core the TPA/.deps.json, or a custom loader). We won't load the assembly using a file path.
So you'd have to get the malicious assembly on the disk, get it loadable by the app, and then send a malicious model to it.
Under normal circumstances, API users won't have to explicitly register components with the
ComponentCatalog
.
What are the abnormal circumstances where an API user will have to register components?
What are the abnormal circumstances where an API user will have to register components?
Whenever they are using strings to reference components and their arguments and try to instantiate them. For example, using the ComponentCreation
APIs:
What are the abnormal circumstances where an API user will have to register components?
Or when you want your app that references ML.NET to be IL Linker friendly.
Or when you want your app that references ML.NET to be IL Linker friendly.
A note to interested lurkers (like me): It looks to me, the use cases are contrained devices such as mobile phones, Raspberries, but perhaps cloud deployments also. One large use case for linking could we WASM components now that Blazor is even deployed into production.
https://github.com/dotnet/machinelearning/blob/master/src/Microsoft.ML.Core/ComponentModel/ComponentCatalog.cs
Does ML.NET really need its own dependency injection framework? Would it be worth looking at decoupling the dependency injection from the ML.Core, and ideally using one of the existing dependency injection frameworks instead of inventing yet another one?