Plan for updating FsLab

fslaborg / zzarchive-FsLab

A collection of packages for data science with F#

http://fslab.org

Other

158 stars 42 forks source link

Plan for updating FsLab #137

Closed dsyme closed 3 years ago

dsyme commented 6 years ago

Now that @zyzhu has started updating Deedle to netstandard 2.0 we should look at updating the whole FsLab collection

@zyzhu - do you use FsLab as an entire collection or just the individual pieces?

dsyme commented 6 years ago

Here are some stream-of-consciousness thoughts on steps to update FsLab:

[ ] Update FsLab packages to latest
[ ] Make FsLab support for netstandard 2.0 (so you can reference FsLab in a netstandard 2.0 project)
[ ] Make sure the package is usable with .NET SDK projects
[ ] Revamp load scripts to account for package locations
[ ] Update template and documentation to use dotnet new

Also:

[ ] Check experience and tutorials on Mac, VSCode for Mac, VS for Mac
[ ] Look again at FSharp.Charting - do we really want to use Gtk etc.
[ ] Revive the twitter account http://twitter.com/fslaborg
[ ] add Python interop to the mix (no type provider, just basic interop and samples)
[ ] consider whether machine learning should be added to the mix (obviously it's needed, just does FsLab want to take a preferred view on the different options)

Links to various parts of this as they get done:

[x] Math.NET Numerics is being made .NET Standard 2.0 (packages seem available, e.g. 4.5.1)
[x] FSharp.Data supports NET Standard 2.0
[x] Deedle supports .NET Standard 2.0
[x] RProvider dependencies updated
[x] update links in FsLab template
[x] add CI support for the package and template
[x] XPlot and .NET Standard 2.0
[ ] Release updated RProvider packages
[ ] Release updated Deedle packages
[ ] Release updated FSharp.Data packages
[ ] Release updated FsLab package and template
[ ] RProvider supports .NET Standard 2.0 (note: not started, is this possible?)
[ ] FSharp.Charting and .NET Standard 2.0 (note: not started, is this possible?)

zyzhu commented 6 years ago

@dsyme I only use Deedle and XPlot. Data is consumed via Dapper. I prototype all my research via FSI. I do hope dotnet core support on FSI will come soon as I refresh the issue everyday :)

I used to heavily use SqlClient type provider until it's not supported on dotnet core. I do use the latest SignalR so that Dapper can mitigate the problem.

I feel we shall discuss about a bigger picture of F# in data science, not only restrained by updating the current libraries to netstandard. @tpetricek

Among FsLab packages, I found FSharp.Formatting as a weak link. I experimented it before but I found it's not too productive to use as it requires some boilerplate and template in html/razor. Documentation and samples were not in good shape either.

To data scientist or quant in finance, I generalize workflow as data retrieval -> computation -> visualization. Jupyter notebook is the de-facto place to start as almost zero boilerplate is required. The notebook can be easily rerun to replicate research step by step. On top of it, dynamic-typed language is very easy to get started for people with stats/math background. That's partially why Python build up an amazing ecosystem.

Though I strongly prefer statically-typed language, I cannot deny the productivity of Python ecosystem. But the productivity comes at a cost such as speed, scalability, type robustness and integration issues. But in practice, most research teams are willing to sacrifice these long-term cost with the boost of short-term productivity at idea generation stage.

In order to grow a community using F# in data science, avoiding boiler plate is a first prerequisite to boost productivity. The productivity from FSharp.Data typeprovider cannot make up the cost of preparing boilerplate to plot results. I found IfSharp is quite useful as I can run all my F# scripts and visualize results using XPlot easily. The scripting and debugging experience is not as great as Visual Studio and FSI. But I just need to copy paste my script to IfSharp and use it as a notebook to record and visualize result. Most of time it works right away.

We shall improve the experience of FsLab on IfSharp as a priority by providing more documentations. It took me a while to dig through various issues to get a Deedle frame printed. Digging these issues will discourage newcomers to F#. More sample Azure notebooks on various topics to educate community will be very useful. @cgravill

Another promising path is via a polyglot Jupyter notebook such as Beakerx by Two Sigma. It has just released version 1.0 with built-in two-way autotranslation. There was an old Beaker F# kernel. Maybe it could be ported so that F# can coexist with all other language/ecosystem on the same notebook. Then more libraries and visualizations can be handy. @aolney https://github.com/twosigma/beakerx/issues/5039

Including an ML library such as Accord.Net/TensorFlowSharp is a good idea. But I am not an expert on it. Maybe @mathias-brandewinder has some good suggestions.

I would also include another optimization library Google.OrTools. They plan to release its FSharp library targeting netstandard in its next version soon. It can solve another branch of users on linear optimization cases. I've compiled its F# library to use in production and found its F# examples very elegant. https://github.com/google/or-tools/issues/722

TonyHenrique commented 6 years ago

I don't know if I missed something, but I feel that F# need to support complex XSD -> XML generation. It is being used heavily here in Brazil by Governament for Sales, Medical, and it would be good to have a easy way to have type safety when generating XML from our data using the XSD Schemas provided by Governament.

See https://github.com/fsprojects/FSharp.Data.Xsd/issues/26#issue-336825128

dsyme commented 6 years ago

It's great discussion, please continue, all the comments are enlightening

Note the list of work items above is not meant to be comprehensive and is a bi stream-of-consciousness.

My take is that FsLab should be a collection of packages which "work together and you don't regret". That is, the packages should

[ ] be useful for data science (but not necessarily a complete set of packages for every eventuality - you might need to add more)
[ ] work cross-platform (including .NET Core)
[ ] work with F# Interactive (including on .NET Core when it is done)
[ ] work in iFSharp Jupyter notebooks
[ ] be well-scoped, i.e. do what they say on the tin, and not more or less
[ ] have relatively few bugs
[ ] have an active maintainer
[ ] be well-documented
[ ] be accepting contributions
[ ] not interfere with the use of alternative packages
[ ] together they should not be "too large"
[ ] be usable as independent components if necessary ("not a big fur-ball that is all or nothing")
[ ] be usable in both data scripting and compiled code

Basically you want to to be able to "add an FsLab reference" and do some data-science workbook programming, whether that be in Visual Studio, VS Code, Jupyter notebooks or whatever.

Equally you should be able to back out of using FsLab and just use individual packages with the same effect.

Machine learning packages for .NET are a little tricky for FsLab. The more complete ones like Accord.NET (which is great) tend to be a complete set of packages in their own right (which is also great). Other packages like ML.NET are a little too early to include. So in general I agree

Interop packages like RProvider, python , MATLAB provider, Excel provider etc. are tricky too. On the one hand these are incredibly useful when they are needed and work, and can benefit from regular integration and use with other components. They are also sometimes painful to get working first time and people sometimes shy away from them. On the other hand they are a source of considerable complexity and documenting them can be tricky.

Note that one approach would be to abandon FsLab as an "integrated" package and simply document the choices and how to get started with them

Finally FsLab today takes a very strong approach to literate programming - and I agree with @zyzhu that FSharp.Formatting is a bit of weak link. I need to understand better where we should end up here.

jackfoxy commented 6 years ago

I think I'm close to having a XPlot netstandard2.0 PR https://github.com/jackfoxy/XPlot/tree/magicmode It builds in VS, but getting strange error with Newtonsoft.Json not recognized in the build target of the build script.

dsyme commented 6 years ago

@jackfoxy That's great. I did a couple of updates to XPlot to fix the paket bootstrapper and documentation generation, you'll want to integrate those

jackfoxy commented 6 years ago

If it's already in master, I'll merge. I also implemented paket magic mode, which is possibly what you did @dsyme

sebhofer commented 6 years ago

Thanks to @zyzhu and @dsyme for starting this discussion. I'm really happy to hear about these developments! I agree with most of zyzhu's points; still, as one of your goals seems to be to attract new users to FsLab, I feel that providing my 2¢ of opinion could be helpful. From my experience starting out with using F# for data processing can be quite though for a newcomer (I'm coming from a science background) for several (some non-technical) reasons. Some thoughts:

First and foremost: getting to know what's available in FsLab is really though. Try clicking through the FsLab website and the project sites. Frankly, it's quite messy. Just looking at the projects listed on the top right is confusing; this top bar lists anything from 2 to 5 different packages, and hardly ever the same. To this day I don't know what exactly is "part" of FsLab. Also, some (but not all) pages link to http://fsharp.github.io/FSharp.Charting/, which is dead.
Starting with Deedle was surprisingly hard for me coming from a dynamic world. Although the documentation is quite extensive, I still needed a lot of time to figure out seemingly simple tasks. I'm not quite sure how one could alleviate this. Maybe a list of common patterns in pandas and their translation to Deedle would help. I also thought about doing a Deedle cheat sheet along the lines of the pandas one, but I never got around to it.
Notebook interface: I completely agree with @zyzhu that these are really useful, and I think it's crucial to have a notebook interface which nicely integrates display of dataframes and plotting without too much fiddling around.
Interop with R and python would certainly be nice, and I think would attract many people who just can't afford to give up using a certain package for some reason or another.
What I would also enjoy to have is a data science template similar to this. I'm not sure if FsLab is the place for it, but on the other hand, there are already 2 templates...

dsyme commented 6 years ago

@sebhofer I agree with all those points, thanks. My first aim here is to get FsLab "clean" and spark a round of work on fundamentals like .NET Standard support. But we can also reassess its whole construction - I'm still not sure it should be anything but a template of the kind you propose (does it even need to be a combined nuget package?)

FsLab is, at the moment:

Deedle
XPlot
Math.NET Numerics
RProvider
FSharp.Charting
Some literate programming support

plus a template. These seem reasonable (though FSharp.Charting should I think be dropped now). I think each is quite well documented (once links are all fixed). But the centrality of the literate programming support is questionable in the world of notebooks.

There are also transient dependencies on

Suave
Newtonsoft.Json
Google.DataTable.Net.Wrapper

the first two of which are questionable, and also optional dependencies on:

Google charts
Plotly
R

dsyme commented 6 years ago

@jackfoxy Could you send a PR for your xplot .NET Standard 2.0 work, even if not yet quite complete? Then we can discuss and others can help get it over the line? thanks

dsyme commented 6 years ago

Starting with Deedle was surprisingly hard for me coming from a dynamic world. Although the documentation is quite extensive, I still needed a lot of time to figure out seemingly simple tasks.

@sebhofer I agree with this and I'm concerned by aspects of the Deedle design. It's possible there are also just better data frame libraries emerging for .NET as well, especially with regard to simplicity and discoverability. We need to reassess this.

jackfoxy commented 6 years ago

@dsyme https://github.com/fslaborg/XPlot/pull/75 not merged with latest master

cgravill commented 6 years ago

I've merged a change to IfSharp to target .NET 4.7.1 to ease interaction with .NET Standard 2.0 https://github.com/fsprojects/IfSharp/issues/181 There is some odd behaviour but with that I'm able to use ML.NET 0.3 in the context of a Jupyter Notebook.

It'd be great to have improved support for FsLab. There was some initial work on this in https://github.com/fsprojects/IfSharp/issues/156 but more would be great. The helper script approach does have discovery issues but it's meant we can keep the core cleaner.

sebhofer commented 6 years ago

@dsyme I'm certainly in no position to judge the Deedle design, but I experienced that it's quite easy (for a beginner) to get bad performance if one is not careful. In my case I had to hack my own merge (or join?) function, because the built-in one would just not finish in reasonable time. (The reason was that the built-in version was too general for my problem and could be simplified considerably.) This is in principle not bad, but certainly slows you down in your day-to-day work. So there is certainly some room for improvement.

To finish, I also have to say that it's just great that @tpetricek is so responsive on stackoverflow with respect to any Deedle issues that crop up (or any F# related problem for that matter :)!

nhirschey commented 6 years ago

Deedle

The work on Deedle is tremendous, but (coming from R, SQL) I unfortunately found it complicated to understand the programming model and gave up on it. I found much more success using base f# data structures. It was far simpler.

That said, saving frames to files and using frames to pass to/from Rprovider are fantastic.

The time series join stuff is also great, but I just end up using Array.find for inequality searches or maps for equality searches.

FSharp.Data

No record collection -> Csv file function is a weak point for saving intermediate results. Hand mapping 30 column records to a CSV row type is not practical or type safe (easy to accidentally transpose two neighboring columns of the same type). So I resort to a version of this: https://stackoverflow.com/questions/25086198/list-of-string-in-a-record-to-csv

formatting

My current workflow is do calculations in F#, save to CSV, then do literate programming in Rmarkdown documents for tables, figures. The blocking issue for using F# formatting is automatic latex table formatting of fancy regressions. I guess integration with R latex formatting via Rprovider is possible, but I haven't tried it.

I think it will be hard to make a lot of progress here, because the first step is to have the statistical models, then second formatting for it. The holdup is the statistical models.

packages used most often

FSharp.Data
MathNet.Numerics
PSeq

Overall

The real limiting factor is easy integration with statistical models. The .NET way is weird and lacks a lot of stuff in R or Stata or SAS; the DSL work by Matthias would have the most impact, coupled with modern standard error functions. But I know the only way for this to happen is contributors. RProvider would be fine for models, except that I want literate formatting too so I might as well just use R.
The proposed (I think) "#r paket FSharp.Data " syntax would make it far easier for beginners in scripts.
Figuring out project/solution files is still the thing that took me the longest to get. My only purpose is to put common code used across multiple .fsx files in xxx.fs files. Leaving in .fsx is problematic if A.fsx has common code used in B.fsx and C.fsx, but C.fsx also needs to load B.fsx. There probably needs to be documentation showing how to go from a simple script file to a larger project. Simple, but important.

aolney commented 6 years ago

At the risk of piling on, since I was mentioned in an earlier post, I thought I'd give an update on Beaker for polyglot notebook programming. If this is an unfamiliar concept, basically it means you have a computational notebook that is simultaneously connected to multiple language kernels, and accordingly you can program in any of the corresponding languages across cells. So you can munge some data in F# and then in the next cell do some statistical modeling in R.

I've been using Beaker for several years with real workloads, and it works very well. The project has recently pivoted towards supporting Jupyter, with a fairly huge loss of functionality during the pivot. The best current polyglot alternative within Jupyter seems to be the SoS kernel, which can also be used in JupyterLab. So far I've only used SoS for small workloads, but it seems very solid.

In Beaker I've had notebooks that use F#, R, Scala, Groovy, Java, and Javascript, using each where it works best (and potentially has a library dependency that I need). From my perspective, this is far better than trying to bring libraries developed in other languages into F# because:

Native libraries are always current
Native libraries have the best documentation/support
Converted libraries can be more difficult/less fluent to use than native (sorry RProvider)

Polyglot notebooks can have some issues, but these seem to mostly be self-inflicted. For example, Beaker had specific kernel connection code for each supported language, making it difficult to maintain dozens of kernels. Also autotranslation (passing data structures between kernels through the notebook) is a cool and often touted feature that can be difficult to implement well with many edge cases. If autotranslation had a very basic implementation, then many of the associated problems would disappear. In practice I've found it's not really that useful except for passing configuration information between cells (e.g. file paths) because data of non-trivial size needs to hit the disk anyways, where it can be read by other kernels.

Anyways, it seems that polyglot notebooks are here to stay. I habitually use F# within this context (favorite language naturally) but use other languages in the notebook when their native support is a more natural fit. As far as F# kernels are concerned, since Jupyter has replaced Beaker, the ifSharp kernel is the best F# kernel to keep moving forward. I've used ifSharp with SoS/Jupyter and it works great.

zyzhu commented 6 years ago

@aolney Thanks for sharing your experience. Your points clarify my confusion about Beaker and BeakerX. I took a quick look at SoS. It seems that it requires setting up a language module similar to https://github.com/vatlab/sos-r/tree/21883327750a1089066e8933843131d6271bfd74 so that SoS can interop F# with other languages. I found the documentation here https://vatlab.github.io/sos-docs/doc/documentation/Language_Module.html

Is that how you get started on using IfSharp on SoS? Any possibility to create a pull request to share your language module to SoS so that it can support IfSharp kernel out of box? That will help F# community get started on SoS notebook.

aolney commented 6 years ago

I've been using a Jupyterlab installation but I think the process is similar for Jupyter notebook.

I believe this is all it takes:

pip install jupyterlab
jupyter labextension install jupyterlab-sos
jupyter kernelspec install ifsharp

In other words there's no need for a new language module b/c ifSharp works with Jupyter.

My understanding (could be wrong) is that the links you provided are only needed for certain functionalities like autotranslation and syntax coloring. They may also be needed for future capabilities like intellisense and linting.

zyzhu commented 6 years ago

@aolney Thanks for clarifying. Yes. I was interested in autotranslation so that variables between python and R can be used in F# and vice versa. I already got F# kernel working on SoS.

One step further is to autotranslate between pandas dataframe with Deedle. But it requires Feather format support as that's how it's done between R and python right now. https://github.com/vatlab/sos-r/blob/21883327750a1089066e8933843131d6271bfd74/src/sos_r/kernel.py#L83

You mentioned data needs to be dumped to file before consumed by other language. I see that's how it's done between pandas and matlab now https://github.com/vatlab/sos-matlab/blob/d818cb93b8988bb8ecf9e4910c12fe7ab9538e73/src/sos_matlab/kernel.py#L102 We can do that instead to support Deedle and pandas dataframe interaction.

Guess this would be another long-term project. At least the path looks clear.

siavash-babaei commented 4 years ago

So, say everything existing is updated and working nicely. Well, a proper RProvider would give you access to whatever is missing in F# and just about anything in python and then some. RProvider though has not worked for quite some time since R 3.5.

Missing, Missing, Missing:

Data Frames: For R and python (through pandas), data frames are kind of the primary core data structure that one would work with, especially in the case of in-memory data, and almost every library, package, and algorithm is aware of them and utilizes them in one way or the other. Going so far as even many big data tools simply chunk large data into manageable data frames and take it from there. On the other hand, no matter how good and effective Deedle is at handling data frames, Accord.NET, Math.NET Numerics, ML.NET, etc. are not aware of Deedle data frames and cannot directly consume them. (Please correct me if I am wrong ...) Hence severely limiting their usage.
Visualization: In terms of visualizations, both R and python have superb capabilities in ggplot and Matplotlib. FSharp.Charting is not nearly as good. In addition, R has Shiny and python has Bokeh to handle interactive visualization and dashboard, etc similar to what Microsoft Power BI and Tableau offer in terms of preparing interactive reports and dashboards - although obviously less commercially polished ... So far as I know, F# has nothing similar.
Data Retrieval: FSharp.Data is great but it could be expanded upon perhaps incorporating some other TypeProviders as well maybe even supporting data file formats from R, Python, MATLAB, HDFS, SAS, SPSS. The ability to communicate with databases - Whether SQL or NoSQL - and various data sources, is of course of paramount importance. I have not seen TypeProviders for say MongoDB, Cosmos DB, HBase ... As an example, the last update on MongoDB.FSharp which is one of few such projects is 6+ years old and the link on MongoDB website is 8 years old: https://www.mongodb.com/blog/post/enhancing-the-f-developer-experience-with-mongodb.
Big Data: I believe a significant proportion of data analytics workflows are still in-memory, using a standard panel of models for regression/classification/clustering tasks, and do not yet involve stuff like big data, Spark, Deep Learning, and alike. However, not having those capabilities, is a deal-breaker. Standard tools of the trade for that are Spark, Keras, Tensorflow, probably a couple of others would complete the list of necessary tooling. We got Spark for .NET and ML.NET is supposed to offer Tensorflow support, so I suppose the best hope lies with Microsoft and then maybe F# API/Wrappers in time. It's a pitty though ML.NET is written in C#, had it been F#, maybe it would have helped propel F# similar to what Spark did for Scala, not to mention that F# would have been a lot more suitable at core rather than OO C#. Nonetheless, ML.NET is an obvious candidate for inclusion in FsLab and bigger involvement from F# lads. It offers a very welcome more unified approach to doing data science plus support for Tensorflow and ONNX
Light Intuitive Syntax: Since most of the time, you are doing exploratory analysis and prototyping, quick turn-around is a very important feature. What we have in the likes of ML.NET and Accord.NET is very C#, too awkward and verbose for quick and dirty hacking. In R, you would do:
```
      model <- lm(data = scores, score ~ age * sex)
```
and then, from this model object, you can extract whatever you need, including statistics, coefficients and confidence intervals, error estimates, etc, even diagnostic plots, with some pretty intuitive names. To me, doing the same thing as above and almost perfect in F# would go like:
```
      let model = 
          let data = scores
          let response = [ "score" ]
          let predictors = [ "age"; "sex" ]
          (data, response, predictors)
          |> linearModel ModelType.OLS CrossEffects.Multiplicative
```
with model object perhaps being a record type with fields corresponding to coefficients table, error estimates, basic statistics, etc.

siavash-babaei commented 4 years ago

For whatever product, you would require a few killer features that would make it indispensable, and for F#, it could easily be the entire data analytics and data science workloads. The same thing that greatly helped propel python to the front. The user base, especially, being more mathematically inclined and comfortable with the syntax (I just love/adore it but dunno why makes lots of people uncomfortable), ideas of immutability and the core of language being input -> function -> output, would be much better adopters than say, developers active in GUI or web. There are other areas I am sure, for example, business applications that fit nicely with Domain-Driven Design. But data science workloads - incidentally, a perfect match for DDD - are certainly worth the investment, especially as they seem to be exponentially growing both in volume and utilisation. If you think about it, one of the most active open source big data projects, Spark, is only 7 years old - with many users adopting a difficult language like Scala just to use full Spark capabilities and performance. The community as a whole seems to be more-so accepting of learning and new tech that makes their life easier. FsLab could be that unified environment for data analytics pipelines with a comprehensive suite of up-to-date tools accessible from whatever OS, with pieces that have the necessary awareness of each other. Kind of pointless to have a data frame that cannot be readily consumed within the tools that you use to analyse your data: ML.NET has its own data frame and btw, it seems very inferior to that of Deedle; and, Accord.NET has its own extremely horrible way of consuming data in the form of arrays. It is going to be an involved process though starting from selecting a set of standard features the community and more importantly, the language requires in this regard - a lot of input needed from developers and more importantly, users. Further steps could even involve attracting corporate support and money. Ideally, you would end up in an environment like MATLAB, R, or Julia, where you can readily hack quick-and-dirty, just as well as develop polished applications (very clumsy and difficult to do in R/MATLAB and unsound/non-performant in python).

siavash-babaei commented 4 years ago

Corporate support could be subtle, could be a lot of things from adoption and critique to code contribution to money, marketing, etc. For example,

make F# code usable directly from within SQL Server and Power BI (remembering that a big component there, M Language, is inspired by F#) the same as is currently the case with R and Python; and/or,
a backend (or some form of help with one) for ML.NET and SPARK.NET in carefully designed stable idiomatic F# 5.0 with superb documentation, tutorials, etc.
encouragement of more visibility and publicised utilisation: if you are doing something in F#, make a note of it somewhere on your website ... publish a link to that in some forums/blogs, yadi-yada-yada ...

dsyme commented 4 years ago

Adding comments from https://github.com/fslaborg/RProvider/issues/209

It's a good time to finally address this issue. There are many questions being discussed here. Let's just deal with the question of FsLab and its pieces.

Here are my opinions:

The whole idea of a curated, unified collection like FsLab has turned out to be suspect as it doesn't really allow for change, evolution and deprecation unless very actively curated.
The curation stopped because FsLab as a collection was based on .NET Framework, and some parts of the collection suffered badly in the transition to .NET Core. It only took one part to be still stuck on Mono or .NET Framework to render the whole thing stuck. That's what happened.
With mono out of the loop things are easier once we establish a reasonable landing point
MSFT is active in the parts of FsLab we directly care about - XPlot, fsdocs (which now generate .NET Interactive notebooks), .net interactive, F# literate scripting. It is also active in many related technologies. Other companies also contribute
Reorienting to join forces with SciSharp, .net interactive and similar seems much more practical.
FsLab certainly needs to be taken down and/or revamped on .NET Core only and/or wound up as a "one-stop shop technology". That will create space for better approaches I think. I'm open to suggestions but we need to rethink things.
Note I'm not so interested in discussing this from a "future of F#" perspective (this has nothing to do with F# and web programming, for example) but rather just practical steps to get things cleaned up on on a good sustainable coherent basis going forward

Looking forward to making some progress here....

dsyme commented 4 years ago

BTW as can be seen from the discussion above I took a crack at modernizing FsLab to .NET Core in 2018. I was honestly shocked how hard it was back then. It will be much easier now.

FsLab had dependencies on FSharp.Formatting for literate scripting, including Razor for templating, and relied on mono. At the time, FSharp.Formatting was barely functioning on .NET Core, and we only removed the Razor dependency earlier this year.

Anyway, in 2020 I finally went through and added a usable .NET Command line tool to "fsdocs" to FSharp.Formatting which includes the literate scripting functionality of FsLab.

Another major factor is VS and VSCode. VSCode is now the obvious place to centralise all such work.

siavash-babaei commented 4 years ago

ML.NET plus a feature-complete SciSharp with idiomatic F# support would solve most - if not all - problems: while C# syntax is awkward for data science pipelines, F# syntax sits very nicely with it, but maybe F# syntax would be too much an ask or for a much longer horizon. The current project appears to be a work-in-progress though with some ways to go before all features are delivered.
You would still definitely need a simple, easy, and quick way to interop with R or Python for cases not covered in SciSharp, etc. Julia offers python interop I think, and F# used to do the same with RProvider but that hasn't worked for a long time.
Interactive data visualisation and the ability to build reports and dashboards are also very important, something akin to R Shiny. Adding F# to the roster of languages you can utilise in Power BI for scripting could alleviate the matter to some extent. Python Bokeh is not nearly as good as R Shiny and anyway it is not in the list of SciSharp projects.
@dsyme, A "unified approach" in the sense of something similar to SAFE Stack or SciScharp Stack actually works brilliantly. However, we should note a few things:
1. The size of the F# Community as a whole doesn't allow for very active curation in many cases.
2. The size of the F# Data Science Community is even smaller. More importantly, plenty of us have little to no experience in software development and specifically, large-scale ones, and as a result, even when we do occasionally do develop software, it leaves much to be desired in terms of sound design documentation, etc. Both R and Python are littered with libraries that whilst working, have no proper documentation, are badly-designed, inefficient, and slow, because they were done by statisticians or mathematicians more concerned with getting it done fast rather than properly experienced developers.
3. These issues kind of make it very rational to rely on readily available tools as much as possible rather than heavily maintain native projects. Do away with Deedle, FSharp.Charting, MBrace, etc. for starters. Add and in-time, build on, tools like SciScharp Stack, ML.NET, and Spark.NET, providing FSharp-specific backends simplifying syntax making it more productive to code for data analytics. Once we settle on projects to include, we would have the added benefit, that we can devise ways to have meaningful impacts on the underlying projects - the sooner the community can affect something life SciSharp, the better.
@dsyme, FsLab becoming a "one-stop-shop technology", is actually the ideal approach - An "FsLab Stack" maybe. Who does not like the convenience and more importantly, you gotta consider the users/customers of that tech which are a different breed of programmers/developers than the ones you would see in your usual software development industry, with different needs. Coding is not their primary purpose or activity per-se, rapidly going through vast exploratory steps, discarding and selecting, and communicating results is.

WalternativE commented 3 years ago

I'm by no means a solid practitioner (currently trying to get into the field from my background as a software engineer) so I'm talking more opinion than knowledge here.

I think @siavash-babaei already lays down a lot of valid points. I'd just like to add, that it would be great to get an idea of where we are currently as a community in regards to data science/machine learning. Like the people at https://github.com/CSBiology who are using F# in their research and mainly fell beck to writing a lot of things themselves. SciSharp doesn't really have F# on their radar, at least it seems to me that way.

Many FSLab projects are indeed a bit 'stale'. I've been having good success in working with Deedle but there are a lot of thing one could improve (especially in writing docs). From a technology standpoint the new Microsoft data frame is really fancy (being built on Arrow) but I wouldn't have felt as productive as I did with Deedle. ML.NET works well enough with it in combination and even though the API is a bit odd it is quite fun to use - in the sense, that you feel quite productive. If you want to really tweak it, it gets kind of tedious.

The thing I'm missing most in comparison to a mature environment like the R Tidyverse is the ease of going from exploration to wrangling to modelling to validation and back again. In my current work with F# I really feel that the different libraries I use were built by entirely different teams with little to no regard for each other. Yeah you can somehow plot the data in your Deedle frame...you just have to 'un-deedle' it first. Same for modelling in ML.NET. It would be nice having a set of abstractions, that make this easier. At least some common ground with other projects. Like...I'm currently not even sure how many implementations of linear algebra libraries (and libraries that build upon them) are around. I just know, that they most likely aren't compatible. SAFE is great because it is a set of nice defaults. If I don't like the defaults it is trivial for me to swap them out.

Hope this added something other than more confusion to the discussion. 🙇‍♂️

siavash-babaei commented 3 years ago

Thank you @WalternativE. Actually, Tidyverse can act as a blueprint for similar capabilities in F#. The syntax is very nice and functional providing a sort of HowTo. The whole pipeline from importing, to cleaning, transformations, visualisation, modelling, and communicating results must be handled in ONE unified framework and just as importantly with a simplified, compatible, efficient, and intuitive syntax. Now matter how good Deedle is, what's the point if it is not directly consumable in ML.NET or whatever other tool. MBrace might be excellent but what's the point while you don't have access to established tools for the job, also if we want it to take off, there should be ports in C#, R, Python, Java, etc., to attract sufficient userbase and gain a foothold... In any case:

Data. Web HTML, JSON, XML, Text, Databases SQL/NoSQL, etc. We need an expanded FSharp.Data maybe even encompassing StringProvider just to have everything in one place with the important bits all nicely tucked into a single packaging.
Wrangling. Clean Up, Tidy, Transformations. Deedle is good but what's the point if you cannot utilize it elsewhere. We need concepts of data frames, data cubes, etc with similar capabilities and ease of use as those in R Tidyverse and Python, that can be readily consumed. Having Spark and cohort are of paramount importance here, with a nice F# backend for SPARK.NET, we can get that over with.
Visualization. Something akin to R ggplot or Python Matplotlib, and something like R Shiny for interactive visualization, dashboards, and reports. Plus addition of F# to roster of languages usable from within SQL Server and Power BI for scripting.
Modeling and Validation. Be it ML.NET or SciSharp Stack or something else, we got to pick - with a long term view on maintenance, adoption, etc. - one or two that are more-or-less reliable, stable, and comprehensive giving us ability to utilize cloud, GPU, etc, like Spark, Tensorflow, Keras, ... We can always add some neat F# wrappers around them for syntactic sugar. In any case, developing stuff from ground up is mostly to expensive for our community, better to stick to adopting existing corner stones of tech for handling workloads.

Once we made the adoption and added F# backends, we can begin to affect the underlying projects bit-by-bit if we are quick enough in adoption and too late to party. Caution: with Python BDFL, GVR, moving to Microsoft, will they just can ML.NET which was supposed to be a comprehensive framework in favor of SciSharp?! And will this move sideline F# if we are not careful?!

zyzhu commented 3 years ago

The visions discussed above sound wonderful. However, to be realistic, these visions need a proper business plan and sponsorship for long-term sustainability.

RStudio is behind the huge push to modernize R development since Hadley Wickham joined RStudio in 2013. https://insights.stackoverflow.com/trends?tags=dplyr%2Cggplot2%2Ctidyverse If you check a few tag trends on stackoverflow, Hadley's ggplot2 got popular since 2009, but tags such as tidyverse and dplyr only took off after 2016.

We already got clear indication from Don about what Microsoft team will focus on. I think the community shall just organically improve other individual components or create something from scratch. It is unrealistic to expect a holistic approach to build the whole data science ecosystem until a company like RStudio shows up. Maybe it will never show up.

Many magics in tidyverse rely on the flexibility of dynamic typing. I always feel bad seeing some dplyr samples like the following. To achieve similar result, it will be super verbose in Deedle https://dplyr.tidyverse.org/index.html

starwars %>% 
  mutate(name, bmi = mass / ((height / 100)  ^ 2)) %>%
  select(name:mass, bmi)

Where do name, bmi, mass, height come from? You don't need to declare anything but in R it just works as they are column names in starwars. Or this https://genomicsclass.github.io/book/pages/dplyr_tutorial.html

msleep %>% 
    group_by(order) %>%
    summarise(avg_sleep = mean(sleep_total), 
              min_sleep = min(sleep_total), 
              max_sleep = max(sleep_total),
              total = n())

This kind of exploratory work is very productive for a statisticians. Though the above lines are impossible to maintain, statisticians do not care as productivity of exploratory is the priority. I do not see any static typing language to beat this.

My point is that dynamic typing language and static typing language have their own pros/cons to exist in the world and we have to use both if we are in data science field.

WalternativE commented 3 years ago

Thanks for your insights @zyzhu. Yeah, I remember working in a small lab, that used R heavily before the RStudio people came along. It really wasn't as convenient as it is now (pretty chaotic if I remember correctly).

Regarding Microsofts involvement I see that interactive programming is getting a good push (both FSI and dotnet interactive which I've been using daily since it was stable enough to do so). FSharp formatting - or fsdocs if that's the better name now - is also coming along which is super important for documentation (one of the most important factors for me - I can go through source code but it will take me a while to get everything in my head). XPlot is already a library where I see a bit of friction. The CSBiology team already has a project which is currently living in the official plotly organization https://github.com/plotly/Plotly.NET. One can discuss about the API but it appears (at least from the Plotly side) more complete than XPlot. I'm also eagerly watchin the strides DiffSharp makes - one of the most exciting movements in the .NET space right now I'd say - and am all for backing the project. If there is a good ONNX story for DiffSharp I can totally see using interactive environments to work on models, special computing environments for training and ML.NET as a deployment target whenever a model goes into production (at least that is how it would work in my head - the usual disclaimer: I'm an enthusiast and no expert). Microsoft has its own ideas how to work with rectangular data in .NET. I'm not entirely sure how that's going to play out, I'm not even sure if Microsoft knows.

All of these projects are pretty disconnected from SciSharp. Basically every project I mentioned (apart from interactive programming and plotting) has one (or multiple) alternatives in this organization. From what I've experienced so far they aren't really compatible with most of it. If there were some abstractions we'd share I could see this change. Like in R it is really easy to work with most libraries because the notions of a vector, a matrix and a dataframe are engrained in everything. In the Scipy stack basically everything build on top of NumPy and whenever they use rectangular data they try to offer some sort of pandas interop. Right now - as I said before - we have quite a lot of linear algebra implementations, five - at least that's the amount I'd be aware of - major interpretations of rectangular data (three of them from Microsoft) and widely scattered statistics libraries. At least in the space of classical ML we're down to a narrow field because some of the old contenders simply died out. Compatibility between all of those components isn't really there so comitting to one takes away developer mindshare from the others. I love the F# community but we're simply not big enough to play zero sum games.

Relating to R being more flexible because it is a dynamic language. Yeah, that's right. The API is slicker due to not having to look out for types. As you mentioned, that makes working with it faster and the resulting analyses more brittle. I'm still pretty convinced, that it can be possible to get to a sweet spot where the API is - for the most part - statically typed and nice to work with. F# is capable of doing really impressive stuff with types but I'm totally with you, that it is not really possible to make it as convenient as the same API in a dynamic language. Still, I'm gladly taking the burden of writing more explicitly typed code if I can get some guarantees, that my code actually works (and keeps on working in the future). I don't want to bash R and/or Python but reproducing an environment using any of both languages is just so much pain. .NET is really good at working reliably (yeah - still my point even after the never ending story of moving to .NET Core, One .NET, whatever name they're going with today).

So, to finally get to a point. I can live with not having a one-stop-shop. I am happy with there only being a core of very well maintained libraries and tools, that help me to do DS/ML in .NET. I just think it would be necessary to have a group of people with skin-in-the-game, that talk about some foundational parts and interop. Some place to look to if you want to align with the rest of the "scientifically minded" .NET community. Everything, that enables individual contributors to build something without reinventing the whole stack from the ground up.

I hope my "ramblings" make some sense to you all and maybe contribute a bit to the greater discussion. 🙇‍♂️

zyzhu commented 3 years ago

Just to pile on a bit more. R can work with Python through reticulate package backed by RStudio. RStudio acknowledges the benefits of integrating two complementary ecosystems. https://blog.rstudio.com/2019/12/17/r-vs-python-what-s-the-best-for-language-for-data-science/

"For individual data scientists, some common points to consider:

Python is a great general programming language, with many libraries dedicated to data science.

Many (if not most) general introductory programming courses start teaching with Python now.
Python is the go-to language for many ETL and Machine Learning workflows.
Many (if not most) introductory courses to statistics and data science teach R now.
R has become the world’s largest repository of statistical knowledge with reference implementations for thousands, if not tens of thousands, of algorithms that have been vetted by experts. The documentation for many R packages includes links to the primary literature on the subject.
R has a very low barrier to entry for doing exploratory analysis, and converting that work into a great report, dashboard, or API.
R with RStudio is often considered the best place to do exploratory data analysis.

For organizations with Data Science teams, some additional points to keep in mind:

For some organizations, Python is easier to deploy, integrate and scale than R, because Python tooling already exists within the organization. On the other hand, we at RStudio have worked with thousands of data teams successfully solving these problems with our open-source and professional products, including in multi-language environments.
R has a great community of supportive data scientists from diverse backgrounds. For example, R-Ladies is a global organization dedicated to promoting gender diversity in the R Community.
Most interfaces for novel machine learning tools are first written and supported in Python, while many new methods in statistics are first written in R.
Trying to enforce one language to the exclusion of the other, perhaps out of vague fears of complexity or costs to support both, risks excluding a huge potential pool of Data Scientist candidates either way.
Advice on building Data Science teams often stresses the importance of having a diverse team bringing a variety of viewpoints and complementary skills to the table, to make it more likely to efficiently find the “best” solution for a given problem. In this vein, R users tend to come from a much more diverse range of domain expertise (ecology, economics, psychology, bioinformatics, policy analysis, etc.)."

https://rstudio.com/solutions/r-and-python/ I heard from @cartermp's video that the roadmap of dotnet/interactive might consider interop with Python. I just hope Microsoft has an even grander vision.

If I were in their shoes to push ML and data science, Microsoft shall just acquire RStudio and integrate more with the wide community, similar to the acquisition of GitHub to push OSS, acquisition of Xamarin to push cross-platform development. Anyway, this is a bit off topic.

siavash-babaei commented 3 years ago

Dear @zyzhu, you are obviously right that there certainly are a great many serious hurdles, exacerbated by limitations of community size and corporate support. Then again, it is the question of the Egg and the Chicken: to expand community and attract support, you would need a working blueprint with bells and whistles and features to get some meaningful investment. The issues surely need carefully thought-out technical, business, and implementation plans but they are not by any means far-fetched or u attainable. Similar effort which went to SAFE Stack could just as easily and usefully turn into FsLab Stack. A lot of components are there like in SciSharp Stack or ML.NET but could definitely do with F# sugar which however you cut it, is much less involved than developing from scratch. While it is true that RStudio and Hadley Wickham with the addition of essentially Tidyvers dialect was a big boost for R, R was the biggest name in data science much before that, although the syntax was rubbish compared to now. In comparison, we have @dsyme and @tpetricek and then we got Microsoft which dwarfs RStudio by any measure. Microsoft has recently shown great willingness and initiative in data science from purchase of Revolution Analytics and turning it into Microsoft R, to addition of Python more-or-less as a first grade citizen and now bringing over of GVR. So they are willing to spend the money and contribute to development. We just need to carefully craft an approach and message, utilize current assets within the company and take it from there. I do not expect things to be as easy and concise in F# as is with dynamic languages but F# syntax is inherently more mathematical and less verbose than Python or R in many ways and trade offs in sound design and performance implementation and ... are well worth it. Frankly, I really liked Deedle and it offered a lot of tidyr and dplyr and lubridate functionality, etc. I used Deedle to do a project ingesting, cleaning, and tidying large text files containing sensory data resulting in read-for-analysis CSV files. I did not experience problems, although few operations I needed to write myself but overall very nice and usable with easy and fluid syntax that actually reminded me a lot of R Tidyvers. For this project it was OK since the output was essentially a set of data frames or rather a data cube perhaps exported to files. It would have been problematic though had I needed to do analysis in F# since nothing else understands Deedle data frames. All in all, my experience is with F#, we can have a one-stop-shop, that even though strongly-typed, with its naturally concise and mathematical syntax, F# can still be very viable and pleasant when exploring data as well as much more performant, safe, and easily maintainable when producing final results. This is actually something Deedle demonstrates beautifully that we can have a strongly typed, smooth, concise code that lends itself perfectly with data science and exploratory steps. And frankly, no we don't got to know both static and dynamic languages. What you get in reality is most people using R and Python because they can handle the entire data workload wheather it is basic data analysis, visualization/communication/presentation, or advanced modelling, be it in-memory or in-cloud, on CPU or GPU, both languages having necessary tools and delivering results with quick turn-around. We should have this in mind that while both R (bad core language design with horrible syntax also slow) and Python (very OO and slow) are inherently poorly suited to modern data science, the whole design and philosophy of F# fits just perfectly with data workloads that are always about immutable data structures, Input->Function->Output, with perhaps only visualisations as major side effects.

nhirschey commented 3 years ago

On Don's main point about FsLab as an organization: agreed that a curated collection of packages doesn't work well when the underlying packages are not actively maintained. Deedle is the only one actively maintained. Even the main FSharp.Data is suffering and relying on Don for maintenance.
We need active maintainers, especially ones in this thread coming from the engineering -> data perspective. This is the type of person who could help with things like getting the FSharp.Data build working again.
tidyverse, what specifically is better? Tidyverse consists mostly of
- Magrittr pipe operator %>% which was inspired by |> from F#. F# compares fine in this dimension.
- dplyr filter/mutate/summarise verbs map very closely to the Seq/List/Array collections. Personally, I find base F# easier to understand than Deedle. Deedle is nice because of special functions for inequality joins, stats helpers that @zyzhu has added in, and you don't need to create as many record types as you iteratively reshape data (though anonymous records now help the type proliferation concern). But take @zyzhu's example, to my eye it is only slightly more verbose in base F#, though I admit being able to programmatically select record fields (as in name:mass) would be nice.
```
starwars
|> Seq.map(fun x ->
{| name = x.name; height = x.height; mass = x.mass; bmi = x.mass / ((x.height / 100.0)  ** 2.0) |})

msleep
|> Seq.groupBy(fun x -> x.order)
|> Seq.map(fun (order, xs) ->
let sleep_total = xs |> Seq.map(fun x -> x.sleep_total) |> Seq.toArray
{| order = order
   avg_sleep = Array.average sleep_total
   min_sleep = Array.min sleep_total   
   max_sleep  = Array.max sleep_total
   n = Array.length sleep_total |})
```
- ggplot2 is great for publication quality graphics, but the plotly APIs seem fine with base F# collections (the equivalent of R vector, matrix, data.frame). Better control for saving Plotly images to files (like ggsave) would be useful. There's a 4 year old issue about this on XPlot (https://github.com/fslaborg/XPlot/issues/43).
Missing statistical modelling and model reporting, this is the biggest issue for me. F#/C# has a good story for machine learning (due to Microsoft interest), but little in terms of statistics. This infrastructure is needed, but it's unclear to me whether to coordinate on extending ML.NET, Math dotnet numerics, the new Deedle regression interface, DiffSharp, etc. My sense is that there's little interest in this from the F# community.
- There is no way to do panel data analysis (https://cran.r-project.org/web/packages/lfe/index.html or https://www.statsmodels.org/stable/index.html).
- There is no way to do robust standard error estimation (https://cran.r-project.org/web/packages/sandwich/index.html or https://www.statsmodels.org/stable/index.html).
- There is no way to generate publication quality model summaries (https://vincentarelbundock.github.io/modelsummary/).

matthewcrews commented 3 years ago

I would love to help with this project. I spent a lot of time in the R/tidyverse before becoming a full time F# dev. Having RStudio and a set of recommended libraries was a huge boost for early adoption. CRAN has a huge number of libraries that exist outside of the tidyverse that are still widely used (Ex: data.table). Having a prescription at the beginning streamlined people entering the ecosystem.

I am looking into creating a Developer Advocate role at my company so I can spend more of my time on accelerating adoption of F# for Machine Learning, Optimization, and Engineering. I know that having someone focus on easing the on-ramp to a language will go a long way in growing the community.

matthewcrews commented 3 years ago

I completely agree with @zyzhu when it comes to the productivity of Python/R. Dynamically typed languages have an "advantage" in that it is easy to start whipping together code quickly. Part of the reason I am so excited for the idea of Erased Discriminated Unions is that it addresses one of the pain points associated with a statically typed language compared to a dynamically typed ones. My hope is that we can build on this and potentially an additional enhancement to Computation Expressions to "re-capture" the productivity of R.

F# is a great language for Data Science/Analytics. I think with a few key enhancements, it could match the productivity of Python/R.

The Anaconda Python distribution solved the problem of "How do I get started?" I don't think F# should go that far, but having a curated list of recommended libraries and walkthroughs go a long way to streamlining adoption. The key metric I look at is, "How long does it take me to ingest an arbitrary CSV file, plot the data, and perform some kind of model fit?" Tightening that loop will be critical.

There are some "ergonomic" issues that could be improved with some key features that I think will make F# a more powerful language and better suited for Analytics/Modeling/Data Science.

cartermp commented 3 years ago

@WalternativE regarding this:

XPlot is already a library where I see a bit of friction. The CSBiology team already has a project which is currently living in the official plotly organization plotly/Plotly.NET. One can discuss about the API but it appears (at least from the Plotly side) more complete than XPlot

Since I'm the current maintainer of XPlot I'll say that I intend on people eventually moving over to Plotly.NET. We spoke with the team over at plotly and they're helping fund the effort. I think that XPlot is pretty good, but anything involving plotly should ultimately just use official bindings when a commercial entity like Plotly is interested in long-term maintenance. They were quite pleased with F# community activity in OSS and felt like it was a good investment.

So, where does that leave XPlot? Unsure, since the plotly package is the most-used and the most feature complete, and I'm likely to encourage people to move to Plotly.NET once it reaches 2.0.0. Its charter of having a consistent-ish API across different charting APIs remains unchanged and it's probably still a fine choice for several tasks.

Anyways, that's how I'd consider the issue of charting with Plotly as a backend moving forward.

WalternativE commented 3 years ago

@cartermp thanks for the update and thanks for your work in XPlot (and your work as our dearest PM while we're at it 🧡🧡🧡). It's wonderful to see, that there are multiple parties working in unison to make a stable data visualization library for .NET.

cartermp commented 3 years ago

@nhirschey and others - how is FSharp.Stats - https://github.com/CSBiology/FSharp.Stats - when it comes to being a good statistical package? I agree that this is one of the biggest gaps now, and I'm also not aware of any plans for Microsoft to publish anything in this space.

siavash-babaei commented 3 years ago

Dear @cartermp . See this example from

// get coefficients of 3rd order regression polynomial 
let regressionCoefficients = Fitting.LinearRegression.OrdinaryLeastSquares.Polynomial.coefficient 3 x_Data y_Data

// get fitting function of 3rd order regression polynomial
let regressionFitFunc = Fitting.LinearRegression.OrdinaryLeastSquares.Polynomial.fit 3 regressionCoefficients

A better design in my opinion, for example, would yield a single object from which one would then extract various bits per need.

Furthermore, why would one unnecessarily expose users to technical underpinning - like the algorithm used for estimating model coefficients - they would not need most of time. I think people generally could care less whether coefficients have been estimated using OLS or Maximum Likelihood or some form of Gradient Boosting or some entirely different beast.

let regModel = Fitting.LinearRegression.Polynomial.fit 3 x_Data y_Data

let regCoeffs = regModel.Coefficients
let regDiags  = regModel.DiagnosticPlots
let regEstms  = regModel.Estimates
...

Regression is a staple machine learning technique. Better be part of ML load. I think ML.NET already offers similar capabilities. Otherwise seems good.

cartermp commented 3 years ago

Regression is a staple machine learning technique.

Oh for sure, and I think that the lines can sometimes get a little blurred between what is machine learning and what is plain old statistics. But I'm more curious if the library looks as if it could service some of the needs that @nhirschey is speaking towards. The shape of the API could change over time, especially if there is feedback indicating that it's conceptually challenging. I think the key thing is that there are APIs available, though, no matter how strange they may feel.

lqdev commented 3 years ago

Apologies ahead of time for the brain dump.

Love this discussion and would like to contribute in any way to help move this forward. Some of my (biased) thoughts:

Within .NET, F# is the language for analytical workflows. I, as I'm sure many of you do, strongly believe this to be the case.
ML.NET has a strong value proposition when it comes to modeling and deployment. Many classical machine learning algorithms are implemented and performance especially on large datasets is something that's difficult to get in the Python/R ecosystem. Regarding deep learning, there is more limited support, but the extensibility of the framework through TensorFlow.NET and ONNX help overcome some of the limitations. From a deployment standpoint, it runs (almost) anywhere .NET runs. This to me is extremely important. For various reasons, model deployment is still a problem that's hard to solve. The typical approach is, let's wrap the model in a Flask web service and call it a day. I'm sure @matthewcrews may have a thought or two on the topic :). In .NET, even if you're deploying your model as a web service, from a performance standpoint, you get so much for free with ASP.NET core. Not to mention using something like Giraffe or Saturn on top of that and you get additional benefits. In .NET you can take deployments a step further and also deploy to Desktop, WASM, and hopefully Mobile / IoT devices (when ARM support is made available). More importantly, it's the same model you're deploying across all these devices and deployment targets. Not a different version/format per device.
With that said, ML .NET has its faults. Reconciling the first and second points, the ML .NET API is verbose and is weird to use from an F# standpoint. Not unusable, but also not fluid. There are a lot of things that happen before modeling from a data loading, wrangling and visualization standpoint. These are all areas where ML .NET falls short although many of the components are there from a .NET standpoint. There is the DataFrame API, .NET for Apache Spark and IDataView to represent data, but there's little to no interop between them. Same with visualizations. There is little to no interop that allows for seamless visualizations coming from these formats. I'm sure @WalternativE has plenty of thoughts on this :) . In the case of the missing parts of ML .NET and better F# support are areas that the ML .NET team is actively looking into and would love guidance around. I think this thread helps illustrate a lot of the pain points and is great source material on areas of improvement. Again though, I'm not advocating ML .NET is the one stop shop for all things ML. However, it should be seamless to move back and forth between Spark & Pandas DataFrames or similar type of data representation and visualization formats to make the handoff between preprocessing and modeling smoother.
Keeping in mind maintainability and resources, abstractions and extensibility may the path of least resistance as others have mentioned. Particularly areas where these efforts are built around standards. When it comes to data representation, there's Apache Arrow. To a certain extent, Arrow is supported by .NET for Apache Spark DataFrames, DataFrame API, and I believe the IDataView as well. Greater interop between these in the .NET ecosystem as well as extensibility to libraries from other languages would make it easier to move back and forth between these. Likewise, from a visualization standpoint, using a standard data representation format may help interop not only within .NET but more broadly with other Python / R libraries. A bit off topic, but .NET Interactive has shown these polyglot type of environments may work well together. From a modeling standpoint and deployment standpoint, the same applies. Building around standards. ONNX is one of those standards. Although ONNX itself has some shortcomings, mainly around not fully delivering on the interop promise because in the end it's up to each framework to implement and maintain the ONNX operations. TensorFlow is one of the biggest offenders when it comes to not making it easy to use with ONNX. This is for inferencing purposes. However, for training, I believe there are some investments being made around training Announcing accelerated training with ONNX Runtime—train models up to 45% faster.

I'm not sure that a one stop shop solution can address all of the individual steps in the workflow. For example, in Python, you do your data wrangling with Pandas or Spark, but maybe model with SparkML/Scikit/PyTorch. But having something like FsLab, or maybe now SciSharp as a collection or proposed SAFE stack for ML definitely is something worth doing. So long as these libraries embrace the concepts of interop and extensibility built around standards within the .NET ecosystem as well as the overall DS/ML ecosystem. I'm not sure if this is the best way to go about it. I've created a Gist with different steps in the data science / ML workflow steps and .NET libraries that provide support for it. I'm happy to take suggestions on where something like that may live as well as helping add to it to get the curation part going for actively maintained libraries.

matthewcrews commented 3 years ago

@lqdev I really appreciate hearing your thoughts and they echo many of the concerns that have come to my own mind. I think F# is well positioning to be a "full-workflow" language for Data Science/ML. This includes data wrangling, transformations, modeling, testing, and finally deployment. Python/R have optimized the Data Scientist workflow (from data import to validated model) but they are poor when it comes to putting a model into production. Python/R are great languages but they were not build with Production deployments in mind first and foremost (I'm sure I will catch grief for this). They are great languages, for sure, but the jump from development to production is non-trivial (runtime and library versions can be a real nightmare, not to mention debugging, logging, error reporting, refactoring, etc.).

I realize that my initial comments were off topic from what @dsyme was originally asking for so my apologies. I believe Python/R have benefitted from a "standard" set of beginner libraries. 90% of an analysts work can be done with those initials libraries. If a beginners needs are not met with those initial libraries, it is easy to reach out to the community to find a more apt library (Ex: data.table for data munging). I think it would be worthwhile to have a curated list of "basic" libraries which covered the basic workflows for Data Science/Analytics in the F# stack.

Long term I think that we should consider someone advocating for the "ergonomics" for F# for Data Science/Analytics/Statistics. It goes beyond just having the necessary libraries. The workflow of an analyst is fundamentally different from a software developer. I don't think these things need to be at odds though. I think there can be a synthesis of needs. Much of what I do at Quicken these days is taking the needs of disparate departments and finding the solution that integrates these requirements. There are a couple of key features which could really close the gap with Python/R when it comes to analyst productivity.

I would love to help in easing the onboarding with new F# developers however I can. I'm working with my company to try to create an F# advocate position internally. We will see how that goes. We benefit a lot from the F# language so I hope to make the case we could make investing in the community a first class investment.

nhirschey commented 3 years ago

@cartermp, I wasn't familiar with FSharp.Stats before this thread. Looking through it, it has nice F# syntax for descriptive statistics like Seq.stdDev, etc. That is useful and they're building out a good foundation. I applaud their work.

However from what I can tell, like ML.NET, MathNet.Numerics, Accord, etc., the regression modelling is currently inadequate for research purposes. There are no facilities for panel data models (https://en.wikipedia.org/wiki/Panel_data) or time series models with appropriate standard errors. I had hoped that ML.NET would develop these, but it still only has facilities for plain-vanilla standard error calculations in regressions. This is why despite preferring F# I always have to drop back into R whenever I get to the modelling stage of analysis.

I know this is a chicken/egg problem. And I'm thrilled with the F# 5.0 scripting and notebook improvements. I only point it out in the spirit of "what are some things holding back F# in the data science area."

kMutagene commented 3 years ago

Very interesting thread, lucky that i found out about it via twitter 😄. This post may get quite long, i hope that it can add some value to the discussion

Seems like the time where we from @CSBiology can chime in and try to contribute to this conversation.

Within .NET, F# is the language for analytical workflows. I, as I'm sure many of you do, strongly believe this to be the case.

This is our central dogma for 5+ years now. We are in our core a bioinformatics workgroup at a German university. We are using F# for data analysis and research. We love it's syntax, immutability by default, and that it has a functional-first approach. We believe that for these reasons it is perfect for the general workflows that we see everyday, where you basically have several (complex) mapping and folding steps of data to produce results. This is also the reason why we prefer it over both R and python, which are the de-facto 'standard' in our field due to majority vote.

When we started using F# extensively (and now almost exclusively), we saw a lack of, let's call them standard libraries for our kind of application, the most important ones being plotting and (descriptive) statistics.

Therefore, we were often forced to write our own functionality for specific tasks in our analysis scripts, which eventually resulted in libraries like BioFSharp (our main bioinformatics library), FSharp.Stats, and FSharp.Plotly (which is now Plotly.NET directly maintained under the plotly organization). I am aware that alternatives for some of these libraries were developed in parallel, but the general workflow for data analysis that we are using is the following:

Read raw data either via format-specific readers or Deedle (which is an awesome library by the way)
Aggregating and transforming the data based on the question at hand, either by specific functions from BioFSharp or any kind of statistical functionality of FSharp.Stats. This happens on the Deedle frame when possible, but there may be the need to get the numeric columns from deedle, process the 2D array, and then go back into the Frame environment.
visualize analysis performance indicators, data distribution, and results via Plotly.NET at any point of the analysis, use the interactive plot to get insights and repeat the analysis cycle accordingly.
Write the result files and annotate the computational workflow via ontology annotations (admittedly WIP)

The reason why these libraries may be lacking functionality in specific regards or cover more than the name might suggest is that we improve them on a use-case basis, meaning as long as we don't need new functionality in our analysis pipelines, those features keep a low priority. FSharp.Stats is a good example here, as it contains e.g. pretty extensive fitting and signal processing features you might not expect from just the name, but may be lacking statistical features as indicated by other posts here.

What all of this basically boils down to is the following: we gladly contribute to the F# OSS community, but may not have the time and resources to make these libraries feature complete. We would however love to have more contributors and help shape the data science landscape in F# and .NET. This may also be by integrating our projects into existing ones.

bvenn commented 3 years ago

@cartermp @nhirschey Our idea for FSharp.Stats as also laid out by @kMutagene emerged due to the reason you already mentioned. There was a lack in statistical packages compatible with our F# data analysis workflows. Step by step we aim to spot and implement basic, as well as higher level statistical functions/algorithms applicable in scientific data analysis. As @kMutagene already mentioned the project is extended when further functionalities are required in order to analyze our data.

FSharp.Stats aims to cover topics as:

fitting and interpolation
signal processing and clustering
linear algebra
and of course descriptive statistics and statistical testing.

Regarding data fitting, FSharp.Stats covers simple linear regression with confidence and prediction band determination, polynomial regression, and cubic/smoothing spline regression. Models for nonlinear regression cover several growth curve models and common other models as gaussian, exponential or logistic functions.

If you have suggestions for panel data regression strategies or any other, we would be happy to consider it for future extension of the FSharp.Stats library.

nhirschey commented 3 years ago

@bvenn and @kMutagene, sounds good. I'll do some work with your library to get familiar and then raise some issues in your dev repo where we can discuss how to collaboratively improve the feature set and api ergonomics.

kMutagene commented 3 years ago

@cartermp

Regarding all things plotly (I maintain both Plotly.NET and Dash.NET):

So, where does that leave XPlot? Unsure, since the plotly package is the most-used and the most feature complete, and I'm likely to encourage people to move to Plotly.NET once it reaches 2.0.0.

That sounds like a good plan. Contributions from our side follow the same rules as i pointed out above, so help in reaching that 2.0.0 milestone is definitely wanted. I think the strength of XPlot also comes from it offering additional charting APIs.

Becoming feature complete in Plotly.NET is ultimately a question of metaprogramming, as it is impossible for me alone (or a small group of OSS contributors) to keep up with the changes coming from plotly.js by implementing them by hand. So maybe tagging @Shmew here, as autogenerating the Plotly Feliz bindings seems to be automated already. Plotly offers a JSON schema for all charts and style parameters, so it is in theory possible to auto generate Plotly.NET bindings from that, maybe via something like Myriad.

@nhirschey regarding static image export: Plotly now uses Kaleido for static image export from charts, which i already wrapped in a quick-and-dirty POC here but just had not the time to finish and incorporate into Plotly.NET and Kaleido.

Then there is Dash.NET, which i am currently working on. The python version of Dash is offering jupyter integration, which is also the ultimate goal for the .NET version regarding dotnet interactive. I think when that project matures to this state, we have a pretty nice toolbox for visualization in notebooks.

In general I think that the base work has been done on many ends, and now we need community effort to bring all these ends together, be it a curated package, or just a loose set of components that work well together.

Shmew commented 3 years ago

Yeah Feliz.Plotly generates about 95% of the bindings from the JSON schema. Currently it's all done via building strings so I could get something released. If the list style API isn't considered too foreign for normal F# code I think it should be possible to build a generation library that's implementation independent. We've discussed it a little bit in an issue in the XPlot repo. I am quite confident as far as the end-result is, as I'm currently using it in a production Fable app, and it has all the plotly.js examples implemented in the docs.

siavash-babaei commented 3 years ago

Ok, so this has taken a sharp turn towards serious Plotly discussion. Maybe we steer it back towards FsLab and its present and future.

dsyme commented 3 years ago

SciSharp doesn't really have F# on their radar, at least it seems to me that way.

I have discussed this with Haiping, who is one of the technical originators and guiding forces behind SciSharp.

He is very open to aligning F# work into SciSharp, by

adding F# samples/documentation
changing the SciSharp messaging to be about “C# + F#” or "F# + C#" as documentation becomes available.
bringing some F# community projects under that umbrella where appropriate

He has made me co-admin on https://github.com/SciSharp to help facilitate changes. Of course I can't make all these changes myself - but it helps in adjusting messaging, getting multiple perspectives etc.

So basically the door is open on the SciSharp side to work towards merging some efforts here towards a win-win for .NET, F# and C# in this space.

This of course includes continued cooperation and communication with other projects both in and out of the SciSharp umbrella. For example, I'm working on DiffSharp, which is not under the SciSharp umbrella.

This is not a full solution, but it gives us options which change the status quo.

matthewcrews commented 3 years ago

I'm all for merging efforts with SciSharp. My main concern will be whether the SciSharp libraries will have APIs that are F# friendly. One of the pain points of ML.NET has been that the API does not play well with F# paradigms, at least that has been my experience. I believe wrapping OO method calls in F# functions is frowned upon so I am unsure of the best way to address the style differences of C# and F#.