Closed voltcode closed 5 years ago
This is something we are exploring. Would you be able to share what kind of integration you are looking for or the scenario you have in mind?
It is fairly common that when a functionality is built around some data processing, incl. machine learning and is successful with users, users ask for some customizability that lets them experiment, without rebuilding the product.
R and Python as scripting languages with a plethora of high quality models and libraries are perfect for this use - as scripts they don't require recompiling, they are popular among non IT people (in comparison to c#), etc.
So using ml.net, I want to define pipelines that will have some steps defined in R or Python that provide a default model implementation or some data transformations. Next, I want to give the users the ability to open the source of this model, tweak it and upload back, the script may have some dependencies on other R or Python libraries.
In order to do this efficiently, the integration should avoid copying the data if possible ( Apache Arrow maybe?), run either in or out of process, etc.
Let me know if you need further justification for R and Python integration.
Why wouldn't you use C# scripting for that? Is it because you want to use existing R and Python libraries in those pipelines?
@svick yes. The ecosystem of R and Python libraries is huge and beyond catching up in my opinion. Even in Azure, R and Python building blocks that provide extensibility are present (PowerBI, AzureML, etc.). SQL Server itself can now be extended with R.
C# is not popular at all outside IT, nowadays a trader or an analyst may know a bit of R or Python, but will not see C# as familiar. Using C# for extensibility would be against all trends in data science.
Now imagine that we have a transform that would run some arbitrary R/Python code. how would that functionality work during inference with model deployment? Would you expect the model to require R/Python and all required prerequisites? If you already have to maintain the R/Python deployment, what would be the advantage of using C# in this scenario vs pure R/Python?
In short - yes. First of all, code using ML.NET would be a part of a larger service or services that is written in .NET, so there would already be some deployment involved. Dependencies can be packaged using docker or similar, or just installed once via MSI, etc. I understand a service could be written in pure R/Python but taking this line of thinking a bit too far we could render .NET useless ;) A challenge in this interoperability is reducing the amount of memory copying, however smart integration with Apache Arrow could bring exactly what's needed (and open other integrations in the future!)
In my practice, I've encounted mostly cases where tuning and customization was more important on the learning/model building side, not on the deployment. Someone would build a model, deploy it, experiment for a couple of days, tweak learning again, etc. Furthermore, the customization code could have a supporting role that does not need to take part in the inference, for example data normalization, feature extraction for training, whereas production features would already be properly extracted, etc. Sometimes, for production use, the Python/R script would be replaced with optimized version of the empirically discovered combination.
I've seen demos on how well R can work inside new SQL Server, so my guess is that a seamless and efficient integration is achievable, especially given Microsoft R expertise gained after Reviolution Analytics acquisition ;) It's much easier to make deployment smoother than to win the mindshare from scratch that R and Python already achieved.
There's now a base implementation of Apache Arrow in .NET: https://github.com/apache/arrow/tree/master/csharp
Maybe it can help steer the discussion about interop - standardized, high performance column storage that's also working with GPU if necessary, could be a great start.
I wanted to let people in this thread know about our release of NimbusML! This project provides experimental Python bindings for ML.NET. It has a pythonic api for all of ML.NET functionalities and integrates with scikit learn. You'll find more information on the GitHub page.
@artidoro how are you doing interop between Python and .NET?
@artidoro great stuff, although I think .net developers would love the opposite integration even more! Do you plan to develop a set of bindings to use python scikit, etc. from .net?
I am going to include @montebhoover in the conversation for more details on NimbusML.
@denfromufa NimbusML allows you to build/train/score ML.NET models from python. Under the hood it is the same code. What I think is interesting is that you can save a model that you trained in python using NimbusML and load it back in ML.NET in C#. I think some people are interested in exploration/training in python and deployment in C#.
The other interesting aspect that we will soon mention in more details is that ML.NET has in general better performance than scikit learn both in terms of prediction accuracy for models with default hyperparameter settings, and in terms of training time.
@voltcode I agree with your suggestions! I think there is some experimental work on this topic. I will definitely update the thread when I have a better idea on the direction it is taking.
@denfromufa In particular we are using ML.NET's Entry Points API to call ML.NET components from python. The Entry Points API allows a user to working in a non-.NET language to describe a call to an ML.NET estimator or transformer in JSON format and pass the JSON to ML.NET for execution. So in NimbusML we embed the ML.NET binaries in a python package, expose a python API that constructs these estimator/transformer JSONs, and call the the ML.NET binaries via extension modules to execute the constructed JSONs.
@voltcode So this same Entry Points process could be used to create R bindings for ML.NET components if someone was interested in that. As far as data transfer goes, in most cases we are simply passing pointers to in-memory data or to file streams. However, in the case where the NimbusML user wants to take a returned ML.NET IDataView and manipulate it as a Pandas DataFrame, we then do in-memory copying.
I'd be curious if either of you have thoughts on efficiency gains there.
What are the top use cases you're thinking of for calling python from .NET? Do you think users would be most interested in combining components from both sklearn and ML.NET, or in using auxiliary libraries like matplotlib to do data visualization?
Shameless plug here: pythonnet for interop with .NET in both directions used by some projects at Microsoft already. I believe there are equivalents for R too. Regarding data efficiency and avoiding copying for pandas dataframes and numpy arrays, definitely Apache Arrow direction is worth considering.
Also how is it that NimbusML has only 93 stars, but over 3000 watchers!!!
Actually here is ml.net integration with Python based on pythonnet by @sdpython
Will this library offer R and Python integration? Where is it on the roadmap? What kind of data transfer library/format will it use, Apache arrow? Something else?
It is important for solution architects to understand how ml.net is going to fit into the big data picture, it is necessary these days given that java, R and Python are dominant in this space.