dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.99k stars 1.88k forks source link

ML.NET command line tool #1203

Closed Zruty0 closed 5 years ago

Zruty0 commented 5 years ago

I think we should consider making ML.NET command-line tool an actual first-class citizen.

A bit of insider knowledge: we currently already have a commandline tool, that is a port of maml.exe. You can launch it as follows (this shows the 'help' command, or the ? command).

dotnet .\bin\AnyCPU.Debug\Microsoft.ML.Console\netcoreapp2.1\MML.dll ?

Generally, the syntax is dotnet MML.dll <command> <arguments>, and the current list of commands include things like train, CV, showdata etc. (full list is available via dotnet MML.dll ? kind=command).

This command-line tool is actually very powerful (although the language is clunky).

We could easily expand the command-line tool to handle common programming sub-tasks, like:

justinormont commented 5 years ago

I wholeheartedly agree with this proposal.

What would it take to bring the command-line to be a first-class citizen?

Work topics I see:

Additional background information: This is the language used in many of our unit tests and benchmarks.

Zruty0 commented 5 years ago

The biggest problems with the command-line tool are that:

As far as I'm concerned, in order to make it a first-class citizen, we need to shrink the functionality, not expand :)

For instance, our kool kustom subcomponents should become JSONs (and ideally reside in a file, not in the command line).

justinormont commented 5 years ago

What kind of kool kustom sub-components would you convert to JSONs?

I'd cut down the lesser used functionality. This could be larger full concepts/components, down to even small syntax options, like the various ways to specify input/output columns.

Designed in Windows, we hit oddities in OSX/Linux like brace expansion. MAML has a braces+commas syntax. Bash's brace expansion rewrites xf=Concat{col=Features:A,B,C} to xf=Concatcol=Features:A xf=ConcatB xf=ConcatC before our syntax parser, which errors. This can be avoided by using double quotes around the whole command, or putting the command in a .rsp file.

Thoughts on the command-line from a data science perspective

Likes:

Dislikes:

eerhardt commented 5 years ago

I totally agree a command line tool is super useful/valuable.

Do you consider this a "must have" in v1? Or is it something that could be productized in a version after the initial release?

artidoro commented 5 years ago

Adding @GalOshri to the conversation.

jwood803 commented 5 years ago

I would love to help with this. 😄

@Zruty0 @justinormont @eerhardt Is this already in the repository or is it something that will be released later?

eerhardt commented 5 years ago

@jwood803 - That would be great.

The command line tool is already in this repository. https://github.com/dotnet/machinelearning/tree/master/src/Microsoft.ML.Maml

One place that could use some help/attention is to create a .NET Global Tool out of the console application.

Zruty0 commented 5 years ago

@jwood803 it's already in the repository.

I think doing what @eerhardt suggested would be a great first step. After that, I would really love to have these things happen:

1) update codegen command to correctly handle model files trained by ML.NET (as opposed to those trained via train command), and generate the scoring code that is in line with the latest API. 2) Take a stab at documenting our command line language. @justinormont can give some really bizarre 20-line command lines to demonstrate the power of it :)

justinormont commented 5 years ago

Here's one example for classification of GitHub issues:

dotnet MML.dll TrainTest test=corefx_issues_TEST.tsv eval=MultiClassClassifierEvaluator{topkacc=3 nccf=40 opcs=+} data=corefx_issues_TRAIN.tsv loader=TextLoader{quote=- sparse=- col=ID:TX:0 col=Label:TX:1 col=Title:TX:2 col=Description:TX:3 header=+} xf=Term{col=Label} xf=CopyColumns{col=Name:ID} xf=TextTransform{col=FeaturesTextTitle:Title tokens=+ wordExtractor=NGramExtractorTransform{ngram=2} charExtractor=NGramExtractorTransform{ngram=4 all=-}} xf=TextTransform{col=FeaturesTextDescription:Description tokens=+ wordExtractor={}} xf=Concat{col=FeaturesText:FeaturesTextTitle,FeaturesTextDescription} xf=TrainScore{tr=LightGBMMulticlass feat=FeaturesText} xf=Concat{col=FeaturesTrainScoreLGBMOnNGrams:Score} xf=WordEmbeddingsTransform{col=FeaturesWordEmbOnDesc:FeaturesTextDescription_TransformedText model=GloVe50D} xf=TrainScore{tr=OVA{p=AveragedPerceptron{iter=10}} feat=FeaturesWordEmbOnDesc} xf=Concat{col=FeaturesTrainScoreAPOnWordEmbOnDesc:Score} xf=Concat{col=Features:FeaturesTrainScoreLGBMOnNGrams,FeaturesTrainScoreAPOnWordEmbOnDesc,FeaturesWordEmbOnDesc,FeaturesText} tr=OVA {p=AveragedPerceptron{iter=10}} out={c:\output\tutorial3\05-LearnerSweep-TrainScoreOnNGrams,_TrainScoreOnWordEmbGloVe50DOnDesc,_FeatsFromNGramsAndWordEmb\0.model.zip}

The DAG of components looks like:

image

@Zruty0 -- apologies, this one isn't even bizarre

TomFinley commented 5 years ago

Somehow I missed this until reviewing @jwood803's PR #1620.

I think we should consider making ML.NET command-line tool an actual first-class citizen.

@Zruty0, the phrase first class citizen implies there is no intrinsic difference in power and fundamental privilege between two groups of things -- in the case of this programmatic model, this seems to suggest no different between what you can do and the level of support you can expect. Assuredly there are differences in this case -- .NET/CLR is fundamentally more expressive than the command line could ever be. Indeed, we even had another "citizen" called entry-points that is assuredly superior to the command line, and the misguided insistence that the .NET API should be no more powerful than that, was what led to the API that everyone "enjoyed" in ML.NET v0.1. 😏

Sorry if it seems pedantic, but I feel we must not use the phrase first class citizen since that carries with it explicit expectations. At best it is a third class citizen... the .NET API (first) must fundamentally be superior to the entry-points (second) which must fundamentally be superior to the command line (third), which must be superior to the GUI (fourth).

I would probably adopt the following attitude towards it. Let's not make it terribly inconvenient to use, but at the same time, it should not be a first class citizen.

codemzs commented 5 years ago

we already have a CLI for AutoML.