dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.05k stars 1.89k forks source link

ComponentCatalog design issues #208

Closed jkotas closed 6 years ago

jkotas commented 6 years ago

https://github.com/dotnet/machinelearning/blob/master/src/Microsoft.ML.Core/ComponentModel/ComponentCatalog.cs

Does ML.NET really need its own dependency injection framework? Would it be worth looking at decoupling the dependency injection from the ML.Core, and ideally using one of the existing dependency injection frameworks instead of inventing yet another one?

TomFinley commented 6 years ago

So what do you think, MEF, or other?

Over the years we've toyed with the idea of replacing the scheme with MEF but never quite got beyond the "yeah might be nice some day" vagueness. @veikkoeeva mentioned the same idea in #136 actually... It has basically the same idea: multiple implementations of types associated with loadnames, even has a so-called DirectoryCatalog for cases where we want to keep the current system, allows other more explicit schemes for cases where you don't.

jkotas commented 6 years ago

So what do you think, MEF, or other?

My top concern is to make it possible to use ML.Core without heavy-weight dependency injection frameworks.

I do not have opinion on which dependency injection framework is the right one. I believe that people typically like to have an option to bring their own.

TomFinley commented 6 years ago

I see this is your first issue. Welcome @jkotas ! Sorry, I thought I understood the request, but I see I don't. The usage of a some sort of dependency injection framework is a pretty core central concept to how the entire system is able to work at all. Even in just the code that's in this repo so far, I see some hundreds of components advertising themselves as a loadable component and roughly one hundred calls to the component catalog in one place or another to instantiate the corresponding components; and we haven't even published the command line or GUI interfaces yet. I mean, let's just start with model loading. What's your plan?

jkotas commented 6 years ago

I do not have a good feel for the overall ML.Core state yet, so I do not have a good detailed answer to your question.

I think the problem is similar to ASP.NET Core: The lower layers (building blocks) of ASP.NET Core do not have a hard dependency of dependency injection frameworks. The dependency injection frameworks come into play only for the higher levels of ASP.NET Core like MVC. If people want to use ASP.NET without paying the price for the dependency injection frameworks, they can still do that just fine.

@KrzysztofCwalina What do you think about the ML.Core being joined at the hip with a dependency injection framework (its own custom one currently)?

TomFinley commented 6 years ago

Ah OK. Well, again, welcome, and I hope you'll spend some time getting to know the structure of the code a bit more. Let any of us know if you have questions. While I'm sure many libraries (including among them ASP.NET) don't benefit from dependency injection, this library does.

KrzysztofCwalina commented 6 years ago

Yes, I in general worry that ML.NET has too many potentially independent technologies joined at the hip. I talked to @glebuk about it and we agreed to work on simplifying ML.Core.

glebuk commented 6 years ago

ASP.NET and ML.NET are a bit different - the entire ML.NET section of the framework that deals with training and inference relies on having hundreds of independent components (trainers, transforms, loaders, loss functions and so forth) We want to have runtime component binding within the framework.

We should be able to drop a DLL and be able to use it, without having to recompile the common infra. The component catalog allows us to find and load needed components, it also allows us to do customization of a given ML deployment where only needed DLLs are placed.

Moreover, it is a component that allows a runtime binding for higher level APIs, such as GUIs, command line tools and language bindings.

Parts of the framework that do not deal with any components (IDV and related) do not need this, however any time you create a pipeline or try to use any components in any way you would need to load and find them. This is at a pretty low level of the overall architecture. How should we achieve those goals if we don't have a DI/TLC Component Catalog?

jkotas commented 6 years ago

The abstract concept of component catalog is ok. It is https://en.wikipedia.org/wiki/Service_locator_pattern with all its advantages and disadvantages.

The part that is not good is the messy hardcoded algorithm that populates the component catalog. I think that the best way to fix it would be do add explicit methods to register set of components that you would call at the start of the application. For example, the standard learners package can have method like:

public static class StandardLearnersExtensions
{
    public static ComponentCatalog AddStandardLearners(ComponentCatalog componentCatalog);
}

And one would register them by calling this method at the start of the application:

   componentCatalog.AddStandardLearners()

If you would like to keep the model where all services that happen to be present in the app are registered, it should be opt-in by calling an extension method like componentCatalog.AddComponentsByConvention() at the start of the application. And the implementation of AddComponentsByConvention should get the list of the assemblies that the application is composed from by calling proper APIs, not by enumerating files in a directory. The natural way to add .dlls to .NET Core app is by adding their package reference to the app .csproj file, not by manually dropping the .dll to the app directory.

Enumerating files in a directory and applying ad-hoc filter to them is not compatible with .NET Core and other .NET app models. I would not include this option in MI.NET. If you really need it to keep it around for existing users, I would move it into a obsoleted compat library that folks do not get by default.

jkotas commented 6 years ago

If it helps, here is an example of how ASP.NET Core implements similar concept: https://github.com/aspnet/Mvc-Private/blob/34e4fbf92df2d181b441bbbde598b68d4f33d8b4/src/Microsoft.AspNetCore.Mvc.Formatters.Json/DependencyInjection/MvcJsonMvcCoreBuilderExtensions.cs#L15 These extension methods return its argument to make it easy to chain them with other similar methods. Typical ASP.NET apps have code like this on the initialization path:

services
    .AddMvcCore()
    .AddJsonFormatters()
    .AddSignalR()
dsyme commented 6 years ago

@jkotas @KrzysztofCwalina I've given some feedback on how the ComponentCatalog is causing pain for F# scripting here: https://github.com/dotnet/machinelearning/pull/600#issuecomment-409231443. It is also the root cause of https://github.com/dotnet/machinelearning/issues/401 which required the awkward workaround of forcing artificial loads of some referenced assemblies:

let _load = 
    [ typeof<Microsoft.ML.Runtime.Transforms.TextAnalytics>
      typeof<Microsoft.ML.Runtime.FastTree.FastTree> ]

It is important that ML.NET work cleanly in conjunction with REPL scripting contexts (F# or C# and future possible REPL-hosted scripting in workbooks etc.). The use of DI and service-locator patterns generally makes this more painful, at least in conjunction to "normal" .NET libraries.

@Ivanidzo4ka and @eerhardt have replied with current thinking at https://github.com/dotnet/machinelearning/pull/600#issuecomment-409356035

We should be able to drop a DLL and be able to use it, without having to recompile the common infra. The component catalog allows us to find and load needed components, it also allows us to do customization of a given ML deployment where only needed DLLs are placed.

Moreover, it is a component that allows a runtime binding for higher level APIs, such as GUIs, command line tools and language bindings.

Parts of the framework that do not deal with any components (IDV and related) do not need this, however any time you create a pipeline or try to use any components in any way you would need to load and find them. This is at a pretty low level of the overall architecture.

While these might have been reasonable problems to solve for TLC, my gut feeling is that these goals are not in scope for ML.NET, except perhaps for the specific "zip" scenario mentioned in https://github.com/dotnet/machinelearning/pull/600#issuecomment-409356035. No .NET Core application allows "drop in DLL replacement" deployment models and if we did, we would allow that kind of deployment update for a much broader range of applications.

ML.NET is a machine learning framework for .NET, not a dynamic loader or dependency resolver and/or DI framework.

I'd have different feedback about IDV: that is solving a schematization problem which definitely needs to be solved for a machine learning framework and which, to my knowledge, is not solved elsewhere in the .NET core ecosystem. Ideally the solution would be extracted and used more widely.

Ivanidzo4ka commented 6 years ago

Nothing brings more joy than throwing few principals under wheels of running train. @Zruty0 @TomFinley I'm sure they have something to say :)

dsyme commented 6 years ago

Nothing brings more joy than throwing few principals under wheels of running train. @Zruty0 @TomFinley I'm sure they have something to say :)

Well, you're still only at version 0.3. The train just left the first station :)

Zruty0 commented 6 years ago

Enumerating files in a directory and applying ad-hoc filter to them is not compatible with .NET Core and other .NET app models. I would not include this option in MI.NET. If you really need it to keep it around for existing users, I would move it into a obsoleted compat library that folks do not get by default.

I agree, this is much cleaner than what we have now.

The 'DI everywhere' is a remnant of our command line background, and, as such, it should be consigned to the separate assembly loaded as a separate package (like Microsoft.ML.CommandLine or something).

The DLL scanner, I believe, should not be in the public domain at all.

We still need to address the need to load a model from a serialized file (the comment to #600 mentioned above).

eerhardt commented 6 years ago

ComponentCatalog Design Issues

Problems (from Jan Kotas)

  1. Enumerates all types in all loaded assemblies. This pattern is known to have poor performance characteristics that lead to long startup time.
  2. Enumerates assemblies in application directory. This is not compatible with .NET Core app model. The app assemblies are not guaranteed to be in the application directory in .NET Core (e.g. they can be in one of the shared frameworks or in the assembly cache), or they may not exist at all (single file .exes planned for .NET Core, or .NET Native used for UWP apps).
  3. A long list of hardcoded names to skip while enumerating.
  4. Does ML.NET really need its own dependency injection framework? Would it be worth looking at decoupling the dependency injection from the ML.Core, and ideally using one of the existing dependency injection frameworks instead of inventing yet another one?

Discussion

Responses to the above problems:

  1. It doesn't enumerate all types in all loaded assemblies. Instead, it enumerates the assembly.GetCustomAttributes(typeof(LoadableClassAttributeBase)) of all loaded assemblies that reference Microsoft.ML. Still not ideal, but it is better than scanning all types of all assemblies.
  2. I fully agree this needs to change. We can't have this kind of policy this low in the library. If higher level tools (command-line, UI, etc) want to make policies about where/how to load assemblies, they can. But the core ML.NET library cannot have this policy.
  3. Same as above.
  4. This may be splitting hairs, but technically I don't view the ComponentCatalog as a dependency injection framework. It doesn't actually inject dependencies into a graph of objects. Instead, it is simply a catalog of components that can be instantiated. Granted, this is part of what DI frameworks provide (a catalog of components), but DI frameworks do much more. Ex. lifetime policies, injecting objects into graphs of objects. ML.NET has some specific requirements that make it hard to use the existing DI frameworks. One example is there is code that doesn't want just any old instance of a "foo" component. It wants a new "foo" component with these specific arguments passed to it. And then a few lines later, it wants another "foo" component with different arguments passed to it. And then later it wants a "bar" component with a completely different set of Types of arguments. I haven't seen this kind of support in existing DI frameworks in the BCL. Typically, existing DI frameworks expect the constructor parameters to be fulfilled by other components in the catalog. Not to be imperatively given by the caller who is requesting the component. I feel it is reasonable for ML.NET to have its own "Component Catalog" system for creating components.

Usages of the ComponentCatalog

The main usages of the ComponentCatalog today:

  1. In GUI types of scenarios, where the GUI presents a list of available components to the user.
  2. In the MAML command-line, where the command line uses a string "Name" to reference the component it intends on creating/using.
  3. In the "Entry Points" subsystem, which allows for non-.NET languages to interop with ML.NET (using a JSON document, which uses a string "Name" to reference the component).
  4. In the saving of "model" files, it writes a string "Name" to reference the component to load when the model is loaded.

Note It is also being used in the PipelineInference code to do things like suggest sweepers and "recipes"

For the first three, you could imagine that the ComponentCatalog can sit on top of the "core" ML.NET library. However, usage #4 is a very core concept in ML.NET, thus the catalog cannot be separated from the core library.

One option to solve #4 is to instead write the fully-qualified .NET Type name into the model file. However, the drawback here is that ML.NET would then be locked into that .NET Type forever for that component. The namespace and class name of these types could not be changed without breaking existing models. Using a string "Name" allows ML.NET to change the Type in the future, as necessary. Additionally, solving #4 doesn't solve #1-3, so there still needs to be some sort of ComponentCatalog. We might as well embrace it as a core concept in ML.NET since it solves all usages.

Proposal

  1. The core ML.NET library should stop enumerating/loading all assemblies in the directory where the ML.NET assembly exists. This is a policy that should be decided at a higher layer than the core library. ML.NET will continue to scan loaded assemblies for components to register in the catalog. However, this can't just be done without fixing model files. If a model file contains a component that hasn't been registered in the ComponentCatalog, it will fail to load. We can solve this by writing the Assembly.FullName of the component into the model file. Then when loading the model, we attempt to load the assembly (using the full assembly name, not Assembly.LoadFrom(path)), if it hasn't already been registered in the ComponentCatalog.

  2. Alternatively, we could move to a completely explicit registration process by removing both loading assemblies in the current directory and scanning the loaded assemblies. Code external to the ComponentCatalog would be responsible for registering the components up front before attempting to do anything that required the ComponentCatalog (like loading a model). This could be done with a public API like ComponentCatalog.Register(Assembly), which would scan the assembly for components and register them. However, this would change the code users need to write, since they would now be responsible for ensuring the components are registered. We could have some sort of "automatic registration" process using something like static initializers/construtors. When a component Type gets initialized by .NET, it would register its assembly with the ComponentCatalog. This would not be without issues though, because there will still be types/assemblies that haven't been loaded yet when attempting to load a model. So we could borrow from proposal #1 above, and write the Assembly.FullName into the model, to ensure it is loaded before creating the component.

In the short-term my plan is to implement Proposal #1 above to get rid of the worst issues we have today. Proposal #2 can be built on top of Proposal #1 in the future, if we decide that scanning the loaded assemblies is no longer acceptable.

Thoughts? Feedback?

jkotas commented 6 years ago

loaded assemblies

The set of loaded assemblies returned by AppDomain.GetAssemblies() is 100% deterministic. This API only returns assemblies that have been loaded so far.

This set is not full deterministic, e.g. it can vary based on code optimizations: More aggresive inlining will cause more assemblies to be loaded. I have seen cases where harmless JIT inliner change broke the app because of it caused different set of assemblies to be loaded and the AppDomain.GetAssemblies() did not return what the app expected.

I think the explicit registration and scanning only what is explicitly registered should be the baseline default model.

eerhardt commented 6 years ago

We can go with the explicit registration model, if that is acceptable to @TomFinley, @Zruty0, and @GalOshri. It will affect the public API and examples.

One option for the explicit registration model is to embrace the Environment class that ML.NET has, and move the ComponentCatalog from being a static class to being an instance member of Environment.

Then we can have extension methods (similar to ASP.NET above) off of Environment to build up the registrations. So for example, a customer who is using some of the "standard" transforms, but also using LightGBM (a learner that does not come standard), they would write the following:

var env = new ConsoleEnvironment()  // existing
    .AddStandardComponents()        // new necessary line
    .AddLightGBMComponents();       // new necessary line

// existing code that interacts with standard components and LightGBM - possibly loading a model file.

Underneath the covers, these extension methods would work like the following:

public static TEnvironment AddStandardComponents<TEnvironment>(this TEnvironment env)
    where TEnvironment : IHostEnvironment  // or HostEnvironmentBase, whichever we decide to expose ComponentCatalog on
{
    env.ComponentCatalog.Register(typeof(TermTransform).Assembly);  // ML.Data
    env.ComponentCatalog.Register(typeof(CategoricalHashTransform).Assembly);  // ML.Transforms
    env.ComponentCatalog.Register(typeof(SdcaMultiClassTrainer).Assembly);  // ML.StandardLearners
}

And then LightGBM would define a similar extension method:

public static TEnvironment AddLightGBMComponents<TEnvironment>(this TEnvironment env)
    where TEnvironment : IHostEnvironment  // or HostEnvironmentBase, whichever we decide to expose ComponentCatalog on
{
    env.ComponentCatalog.Register(typeof(LightGbmBinaryPredictor).Assembly);
}
eerhardt commented 6 years ago

New Proposal

After chatting with @TomFinley and @Zruty0, we've come up with a proposal that we think will address everyone's concerns. It is basically a hybrid of the above 2 proposals.

  1. We will move ComponentCatalog from being a static class to being an instance member on Environment. This has been a planned refactoring for ML.NET for a while, but hasn't been funded until now.
  2. We will completely remove any implicit scanning for components in ComponentCatalog itself. It will have public APIs to register components, but will not do any registering itself - neither by loading assemblies from disk, nor by scanning loaded assemblies.
  3. Other subsystems (like the GUI, command-line, Entry Points, and model loading) will be responsible for registering the components they require in the manner they require.
  4. During model saving, we will write the Assembly.FullName into the model file. We will then register that assembly with the env.ComponentCatalog when loading the model.
    • Any model that was saved with a previous version of ML.NET, and loaded using the API, will need to explicitly register the components before loading the model. (Or they will need to save the model again with a current version of ML.NET that will save the assembly names.)

Under normal circumstances, API users won't have to explicitly register components with the ComponentCatalog. Using the API to train models won't require looking up components from a catalog - you just create .NET objects like normal. Loading a trained model from disk will register the components inside of it by loading the Assembly and scanning it for LoadableClass assembly attributes.

jkotas commented 6 years ago

During model saving, we will write the Assembly.FullName into the model file. We will then register that assembly with the env.ComponentCatalog when loading the model

This means that loading a model can trigger execution of arbitrary code that happens to be laying on disk. It has potential security issue if the model can come from untrusted place. Is it a problem?

eerhardt commented 6 years ago

I think, in general, loading a model from an untrusted place is going to be a problem. I wouldn't recommend it.

As for the issue of loading the assembly using the assembly name, the assembly is going to have to be loadable in the app for it to be loaded (e.g. in .NET Core the TPA/.deps.json, or a custom loader). We won't load the assembly using a file path.

So you'd have to get the malicious assembly on the disk, get it loadable by the app, and then send a malicious model to it.

GalOshri commented 6 years ago

Under normal circumstances, API users won't have to explicitly register components with the ComponentCatalog.

What are the abnormal circumstances where an API user will have to register components?

eerhardt commented 6 years ago

What are the abnormal circumstances where an API user will have to register components?

Whenever they are using strings to reference components and their arguments and try to instantiate them. For example, using the ComponentCreation APIs:

https://github.com/dotnet/machinelearning/blob/a8cd341f8c03985aaea216e8c911c466ad020b1e/src/Microsoft.ML.Api/ComponentCreation.cs#L280-L289

jkotas commented 6 years ago

What are the abnormal circumstances where an API user will have to register components?

Or when you want your app that references ML.NET to be IL Linker friendly.

veikkoeeva commented 6 years ago

Or when you want your app that references ML.NET to be IL Linker friendly.

A note to interested lurkers (like me): It looks to me, the use cases are contrained devices such as mobile phones, Raspberries, but perhaps cloud deployments also. One large use case for linking could we WASM components now that Blazor is even deployed into production.