dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.05k stars 1.89k forks source link

What namespaces to keep, what to drop, and what to hide #2326

Closed sfilipi closed 5 years ago

sfilipi commented 5 years ago

In the table below there are all the namespaces that are in ML.Net. They also all display in the docs site. Some of them, like Microsoft.ML.Ensemble needs to continue to exist in the code, but it is not ready to be exposed to the users, and can be hidden in the docs site.

Some others, like Microsoft.ML.Internal.Internallearn can potentially be merged into other namespaces. Let's annotate in the list below the namespaces that need to be hidden in the docs site, and the ones that need to be gone altogether:

Namespace Drop it from code Hide it from docs Ships In Nuget
Microsoft.ML Microsoft.ML
Microsoft.ML.Calibrator Microsoft.ML
Microsoft.ML.Core.Data drop
Microsoft.ML.Data Microsoft.ML
Microsoft.ML.Data.IO drop?
Microsoft.ML.Data.IO.Zlib drop
Microsoft.ML.Ensemble Microsoft.ML.Trainers.Ensemble hide None
Microsoft.ML.Ensemble.EntryPoints Microsoft.ML
Microsoft.ML.Ensemble.OutputCombiners Microsoft.ML.Trainers.Ensemble hide None
Microsoft.ML.Ensemble.Selector Microsoft.ML.Trainers.Ensemble hide None
Microsoft.ML.Ensemble.Selector.DiversityMeasure Microsoft.ML.Trainers.Ensemble hide None
Microsoft.ML.EntryPoints Microsoft.ML
Microsoft.ML.EntryPoints.JsonUtils drop?
Microsoft.ML.FactorizationMachine Microsoft.ML.Trainers.FactorizationMachine Microsoft.ML
Microsoft.ML.ImageAnalytics Microsoft.ML.ImageAnalytics
Microsoft.ML.ImageAnalytics.EntryPoints Microsoft.ML.ImageAnalytics
Microsoft.ML.Internal.Calibration Microsoft.Ml.Calibrator hide Microsoft.ML
Microsoft.ML.Internal.CpuMath hide Microsoft.ML
Microsoft.ML.Internal.Internallearn drop?
Microsoft.ML.Internal.Internallearn.ResultProcessor drop? None
Microsoft.ML.Learners Microsoft.ML.Trainers Microsoft.ML
Microsoft.ML.LightGBM Microsoft.ML.Trainers.LightGBM Microsoft.ML.LightGBM
Microsoft.ML.Model Microsoft.ML.Onnx , Microsoft.ML
Microsoft.ML.Model.Onnx Microsoft.ML.Onnx
Microsoft.ML.Numeric drop?
Microsoft.ML.SamplesUtils Microsoft.ML
Microsoft.ML.StaticPipe Microsoft.ML.StaticPipe
Microsoft.ML.StaticPipe.Runtime drop?
Microsoft.ML.Sweeper hide Microsoft.ML
Microsoft.ML.Sweeper.Algorithms hide Microsoft.ML
Microsoft.ML.TimeSeries Microsoft.ML.TimeSeries
Microsoft.ML.TimeSeriesProcessing drop?
Microsoft.ML.Tools hide?
Microsoft.ML.Trainers Microsoft.ML
Microsoft.ML.Trainers.FastTree Microsoft.ML
Microsoft.ML.Trainers.FastTree.Internal Microsoft.ML.Trainers.FastTree
Microsoft.ML.Trainers.HalLearners Microsoft.ML.Trainers.HalLearners
Microsoft.ML.Trainers.KMeans Microsoft.ML
Microsoft.ML.Trainers.Online Microsoft.ML
Microsoft.ML.Trainers.PCA Microsoft.ML
Microsoft.ML.Trainers.Recommender
Microsoft.ML.Trainers.SymSgd Microsoft.ML.Trainers.HalLearners
Microsoft.ML.Training drop
Microsoft.ML.Transforms
Microsoft.ML.Transforms.Categorical
Microsoft.ML.Transforms.Conversions
Microsoft.ML.Transforms.FeatureSelection
Microsoft.ML.Transforms.Normalizers
Microsoft.ML.Transforms.Projections
Microsoft.ML.Transforms.TensorFlow
Microsoft.ML.Transforms.Text
Microsoft.ML.UniversalModelFormat.Onnx Microsoft.ML.Model.Onnx
TomFinley commented 5 years ago

My surface reactions.

Regarding the namespaces that are reserved mostly for what we think of as classical "training" aalgorithms, we tended to put them in Microsoft.ML.Trainers, from what I see. (E.g.: Microsoft.ML.Trainers.FastTree).

Microsoft.ML.UniversalModelFormat.Onnx should become Microsoft.ML.Model.Onnx I think.

Since Microsoft.ML.StaticPipe stuff now resides in a separate nuget, it might be appropriate to move it to the Microsoft.ML namespace. However this is of lower priority since that associated nuget will not be stable for v1 I expect. (Might still be nice, since that should be easy to do I think.)

I am not sure that I see the point in the seemingly elaborate hierarchy of Transforms namespaces (e.g., categorical, conversions, feature-selection, normalizers), but I do not feel strongly about it. The negative thing I see about this is that the inevitable result is that the "odd-ball" transformers that are in fact less commonly used wind up being uncategorizable and so put in the regular Transforms namespace. But I do not feel strongly about this.

My expectation is that anything with EntryPoints in the name except for the thing for running/executing entry-points would stay. E.g., Microsoft.ML.EntryPoints would be public, but anything else would be internal. So that renders what I am about to say less urgent, but I would still do it: I do not, for example, see the point in there being a Microsoft.ML.Ensemble.EntryPoints namespace. (We do not elsewhere have the static classes where entry-point definitions reside, reside in a separate namespace, any more than we have the command-line invocation logic live in a separate namespace, etc. etc.)

Microsoft.ML.Trainers.FastTree.Internal should just become the regular namespace. If something really is internal, the classes and types should just be internal. Nothing to do with the namespace.

I do not understand why we have "Microsoft.ML.TimeSeries" and "Microsoft.ML.TimeSeriesProcessing". Anyway, it seems like these belong in one of either the "Transforms" or "Trainers" style namespace, probably Transforms. (So Microsoft.ML.Transforms.TimeSeries.)

Microsoft.ML.SamplesUtils contains a single class. I feel like these things, if we want to keep them, should be extensions on DataOperationsCatalog. I also don't understand what return value being a string means. This whole API is just problematic, not just namespace choice.

To be consistent with trainers, all the stuff in Microsoft.ML.Ensemble should be moved to Microsoft.ML.Trainers.Ensemble. This includes all those strange subnamespaces that seem to exist purely for the sake of communicating, "hey, look at me, the types in me have a different type hierarchy than other types elsewhere, isn't that super cool?" Which is a fairly silly use for namespaces. 😛

Microsoft.ML.Tools, from what I see this should not be (and as far as I can tell doesn't) contain public classes that ship as part of any nuget.

Microsoft.ML.Numeric seems to contain purely stuff used for LBFGS training. I guess i don't see the harm in having it have its own namespace maybe, but it should all be rendered internal, since the guts of our optimizer is hardly part of our public surface.

Microsoft.ML.Learners, should be moved to some appropriate choice of "trainers." Probably the same with so-called "InternalLearn".

Microsoft.ML.Internal.Internallearn.ResultProcessor. Goofy. Nothing in result processor lives in any public surface shipping as part of a nuget, so might as well give it the namespace Microsoft.ML.ResultProcessor.

Microsoft.ML.Calibrator, whatever happened to the idea of making calibration another data transform step?

TomFinley commented 5 years ago

Note that while namespace changes are fine, I feel like the first order of business really must be to choose what we put in Microsoft.ML root namespace itself. There are lots of things that are definitely in this category and already in there (MLContext and extension methods), but lots of things also that are not yet there that probably should be (e.g., ITrainerEstimator, IEstimator, ITransformer, and friends).

sfilipi commented 5 years ago

We have both Microsoft.ML.Trainers and using Microsoft.ML.Learners namespaces. I think in the blogs/announcements we refer to them mostly as Learners. Drop the Microsoft.ML.Trainers in favor of Microsoft.ML.Learners?

sfilipi commented 5 years ago
      We have both Microsoft.ML.Trainers and using Microsoft.ML.Learners namespaces. I think in the blogs/announcements we refer to them mostly as Learners.  Drop the Microsoft.ML.Trainers in favor of Microsoft.ML.Learners?

Converging on Microsoft.ML.Trainers instead, since that seems to be the namespace the majority of trainers live in.

cc @CESARDELATORRE @JRAlexander @luisquintanilla

JRAlexander commented 5 years ago

I'd prefer Microsoft.ML.Algorithms.

JRAlexander commented 5 years ago

It would match https://dotnet.microsoft.com/learn/machinelearning-ai/what-is-mldotnet. "Choose Algorithm"

TomFinley commented 5 years ago

I'd prefer Microsoft.ML.Algorithms.

I would not. We have trainers, transforms, predictors... if some random note written by someone happened to use such a generic term as "algorithm," and it is apparently having the ability to affect how people think about such things, that is what must be changed.

TomFinley commented 5 years ago

Regarding Microsoft.ML.Data.IO, I think everything here should be hidden, it can probably be moved into Microsoft.ML.Data if we care. I personally might keep it where it is though, since it's very, very specific to binary loaders/savers.

JRAlexander commented 5 years ago

Ok, then what about Microsoft.ML.LearningAlgorithms? Again it would match the terms we are already using on the marketing site... and are industry standard...

Choose Algorithm Choose the learning algorithm that will provide the highest accuracy for your scenario. ML.NET offers the following types of learners:

Linear (e.g. SymSGD, SDCA) Boosted Trees (e.g. FastTree, LightGBM) K-Means SVM Averaged Perceptron

eerhardt commented 5 years ago

FYI -

  1. We still have public types in ML.Trainers and ML.Training namespaces. These should be consolidated. -- #2713 has been logged to track this.
  2. We still have Microsoft.ML.Internal.Calibration and Microsoft.ML.Internal.Internallearn, which should be hidden/moved/renamed. -- #2714 has been logged to track this.
  3. we also have 1 type in:
namespace Microsoft.ML.Data.Evaluators.Metrics {
    public sealed class AnomalyDetectionMetrics {
        public double Auc { get; }
        public double DrAtK { get; }
    }
}
  1. we also have 2 public types in:
    namespace Microsoft.ML.Model {
    public interface ICanSaveModel {
        void Save(ModelSaveContext ctx);
    }
    public sealed class ModelSaveContext : IDisposable {
        public void Dispose();
    }
    }

    This isn't enough types (IMO) for it's own namespace.