fslaborg / zzarchive-FsLab

A collection of packages for data science with F#
http://fslab.org
Other
159 stars 42 forks source link

Plan for updating FsLab #137

Closed dsyme closed 3 years ago

dsyme commented 6 years ago

Now that @zyzhu has started updating Deedle to netstandard 2.0 we should look at updating the whole FsLab collection

@zyzhu - do you use FsLab as an entire collection or just the individual pieces?

dsyme commented 6 years ago

Here are some stream-of-consciousness thoughts on steps to update FsLab:

Also:

Links to various parts of this as they get done:

zyzhu commented 6 years ago

@dsyme I only use Deedle and XPlot. Data is consumed via Dapper. I prototype all my research via FSI. I do hope dotnet core support on FSI will come soon as I refresh the issue everyday :)

I used to heavily use SqlClient type provider until it's not supported on dotnet core. I do use the latest SignalR so that Dapper can mitigate the problem.

I feel we shall discuss about a bigger picture of F# in data science, not only restrained by updating the current libraries to netstandard. @tpetricek

Among FsLab packages, I found FSharp.Formatting as a weak link. I experimented it before but I found it's not too productive to use as it requires some boilerplate and template in html/razor. Documentation and samples were not in good shape either.

To data scientist or quant in finance, I generalize workflow as data retrieval -> computation -> visualization. Jupyter notebook is the de-facto place to start as almost zero boilerplate is required. The notebook can be easily rerun to replicate research step by step. On top of it, dynamic-typed language is very easy to get started for people with stats/math background. That's partially why Python build up an amazing ecosystem.

Though I strongly prefer statically-typed language, I cannot deny the productivity of Python ecosystem. But the productivity comes at a cost such as speed, scalability, type robustness and integration issues. But in practice, most research teams are willing to sacrifice these long-term cost with the boost of short-term productivity at idea generation stage.

In order to grow a community using F# in data science, avoiding boiler plate is a first prerequisite to boost productivity. The productivity from FSharp.Data typeprovider cannot make up the cost of preparing boilerplate to plot results. I found IfSharp is quite useful as I can run all my F# scripts and visualize results using XPlot easily. The scripting and debugging experience is not as great as Visual Studio and FSI. But I just need to copy paste my script to IfSharp and use it as a notebook to record and visualize result. Most of time it works right away.

We shall improve the experience of FsLab on IfSharp as a priority by providing more documentations. It took me a while to dig through various issues to get a Deedle frame printed. Digging these issues will discourage newcomers to F#. More sample Azure notebooks on various topics to educate community will be very useful. @cgravill

Another promising path is via a polyglot Jupyter notebook such as Beakerx by Two Sigma. It has just released version 1.0 with built-in two-way autotranslation. There was an old Beaker F# kernel. Maybe it could be ported so that F# can coexist with all other language/ecosystem on the same notebook. Then more libraries and visualizations can be handy. @aolney https://github.com/twosigma/beakerx/issues/5039

Including an ML library such as Accord.Net/TensorFlowSharp is a good idea. But I am not an expert on it. Maybe @mathias-brandewinder has some good suggestions.

I would also include another optimization library Google.OrTools. They plan to release its FSharp library targeting netstandard in its next version soon. It can solve another branch of users on linear optimization cases. I've compiled its F# library to use in production and found its F# examples very elegant. https://github.com/google/or-tools/issues/722

TonyHenrique commented 6 years ago

I don't know if I missed something, but I feel that F# need to support complex XSD -> XML generation. It is being used heavily here in Brazil by Governament for Sales, Medical, and it would be good to have a easy way to have type safety when generating XML from our data using the XSD Schemas provided by Governament.

See https://github.com/fsprojects/FSharp.Data.Xsd/issues/26#issue-336825128

dsyme commented 6 years ago

It's great discussion, please continue, all the comments are enlightening

Note the list of work items above is not meant to be comprehensive and is a bi stream-of-consciousness.

My take is that FsLab should be a collection of packages which "work together and you don't regret". That is, the packages should

Basically you want to to be able to "add an FsLab reference" and do some data-science workbook programming, whether that be in Visual Studio, VS Code, Jupyter notebooks or whatever.

Equally you should be able to back out of using FsLab and just use individual packages with the same effect.

Machine learning packages for .NET are a little tricky for FsLab. The more complete ones like Accord.NET (which is great) tend to be a complete set of packages in their own right (which is also great). Other packages like ML.NET are a little too early to include. So in general I agree

Interop packages like RProvider, python , MATLAB provider, Excel provider etc. are tricky too. On the one hand these are incredibly useful when they are needed and work, and can benefit from regular integration and use with other components. They are also sometimes painful to get working first time and people sometimes shy away from them. On the other hand they are a source of considerable complexity and documenting them can be tricky.

Note that one approach would be to abandon FsLab as an "integrated" package and simply document the choices and how to get started with them

Finally FsLab today takes a very strong approach to literate programming - and I agree with @zyzhu that FSharp.Formatting is a bit of weak link. I need to understand better where we should end up here.

jackfoxy commented 6 years ago

I think I'm close to having a XPlot netstandard2.0 PR https://github.com/jackfoxy/XPlot/tree/magicmode It builds in VS, but getting strange error with Newtonsoft.Json not recognized in the build target of the build script.

dsyme commented 6 years ago

@jackfoxy That's great. I did a couple of updates to XPlot to fix the paket bootstrapper and documentation generation, you'll want to integrate those

jackfoxy commented 6 years ago

If it's already in master, I'll merge. I also implemented paket magic mode, which is possibly what you did @dsyme

sebhofer commented 6 years ago

Thanks to @zyzhu and @dsyme for starting this discussion. I'm really happy to hear about these developments! I agree with most of zyzhu's points; still, as one of your goals seems to be to attract new users to FsLab, I feel that providing my 2¢ of opinion could be helpful. From my experience starting out with using F# for data processing can be quite though for a newcomer (I'm coming from a science background) for several (some non-technical) reasons. Some thoughts:

dsyme commented 6 years ago

@sebhofer I agree with all those points, thanks. My first aim here is to get FsLab "clean" and spark a round of work on fundamentals like .NET Standard support. But we can also reassess its whole construction - I'm still not sure it should be anything but a template of the kind you propose (does it even need to be a combined nuget package?)

FsLab is, at the moment:

plus a template. These seem reasonable (though FSharp.Charting should I think be dropped now). I think each is quite well documented (once links are all fixed). But the centrality of the literate programming support is questionable in the world of notebooks.

There are also transient dependencies on

the first two of which are questionable, and also optional dependencies on:

dsyme commented 6 years ago

@jackfoxy Could you send a PR for your xplot .NET Standard 2.0 work, even if not yet quite complete? Then we can discuss and others can help get it over the line? thanks

dsyme commented 6 years ago

Starting with Deedle was surprisingly hard for me coming from a dynamic world. Although the documentation is quite extensive, I still needed a lot of time to figure out seemingly simple tasks.

@sebhofer I agree with this and I'm concerned by aspects of the Deedle design. It's possible there are also just better data frame libraries emerging for .NET as well, especially with regard to simplicity and discoverability. We need to reassess this.

jackfoxy commented 6 years ago

@dsyme https://github.com/fslaborg/XPlot/pull/75 not merged with latest master

cgravill commented 6 years ago

I've merged a change to IfSharp to target .NET 4.7.1 to ease interaction with .NET Standard 2.0 https://github.com/fsprojects/IfSharp/issues/181 There is some odd behaviour but with that I'm able to use ML.NET 0.3 in the context of a Jupyter Notebook.

It'd be great to have improved support for FsLab. There was some initial work on this in https://github.com/fsprojects/IfSharp/issues/156 but more would be great. The helper script approach does have discovery issues but it's meant we can keep the core cleaner.

sebhofer commented 6 years ago

@dsyme I'm certainly in no position to judge the Deedle design, but I experienced that it's quite easy (for a beginner) to get bad performance if one is not careful. In my case I had to hack my own merge (or join?) function, because the built-in one would just not finish in reasonable time. (The reason was that the built-in version was too general for my problem and could be simplified considerably.) This is in principle not bad, but certainly slows you down in your day-to-day work. So there is certainly some room for improvement.

To finish, I also have to say that it's just great that @tpetricek is so responsive on stackoverflow with respect to any Deedle issues that crop up (or any F# related problem for that matter :)!

nhirschey commented 6 years ago

Deedle

The work on Deedle is tremendous, but (coming from R, SQL) I unfortunately found it complicated to understand the programming model and gave up on it. I found much more success using base f# data structures. It was far simpler.

That said, saving frames to files and using frames to pass to/from Rprovider are fantastic.

The time series join stuff is also great, but I just end up using Array.find for inequality searches or maps for equality searches.

FSharp.Data

No record collection -> Csv file function is a weak point for saving intermediate results. Hand mapping 30 column records to a CSV row type is not practical or type safe (easy to accidentally transpose two neighboring columns of the same type). So I resort to a version of this: https://stackoverflow.com/questions/25086198/list-of-string-in-a-record-to-csv

formatting

My current workflow is do calculations in F#, save to CSV, then do literate programming in Rmarkdown documents for tables, figures. The blocking issue for using F# formatting is automatic latex table formatting of fancy regressions. I guess integration with R latex formatting via Rprovider is possible, but I haven't tried it.

I think it will be hard to make a lot of progress here, because the first step is to have the statistical models, then second formatting for it. The holdup is the statistical models.

packages used most often

Overall

  1. The real limiting factor is easy integration with statistical models. The .NET way is weird and lacks a lot of stuff in R or Stata or SAS; the DSL work by Matthias would have the most impact, coupled with modern standard error functions. But I know the only way for this to happen is contributors. RProvider would be fine for models, except that I want literate formatting too so I might as well just use R.

  2. The proposed (I think) "#r paket FSharp.Data " syntax would make it far easier for beginners in scripts.

  3. Figuring out project/solution files is still the thing that took me the longest to get. My only purpose is to put common code used across multiple .fsx files in xxx.fs files. Leaving in .fsx is problematic if A.fsx has common code used in B.fsx and C.fsx, but C.fsx also needs to load B.fsx. There probably needs to be documentation showing how to go from a simple script file to a larger project. Simple, but important.

aolney commented 6 years ago

At the risk of piling on, since I was mentioned in an earlier post, I thought I'd give an update on Beaker for polyglot notebook programming. If this is an unfamiliar concept, basically it means you have a computational notebook that is simultaneously connected to multiple language kernels, and accordingly you can program in any of the corresponding languages across cells. So you can munge some data in F# and then in the next cell do some statistical modeling in R.

I've been using Beaker for several years with real workloads, and it works very well. The project has recently pivoted towards supporting Jupyter, with a fairly huge loss of functionality during the pivot. The best current polyglot alternative within Jupyter seems to be the SoS kernel, which can also be used in JupyterLab. So far I've only used SoS for small workloads, but it seems very solid.

In Beaker I've had notebooks that use F#, R, Scala, Groovy, Java, and Javascript, using each where it works best (and potentially has a library dependency that I need). From my perspective, this is far better than trying to bring libraries developed in other languages into F# because:

Polyglot notebooks can have some issues, but these seem to mostly be self-inflicted. For example, Beaker had specific kernel connection code for each supported language, making it difficult to maintain dozens of kernels. Also autotranslation (passing data structures between kernels through the notebook) is a cool and often touted feature that can be difficult to implement well with many edge cases. If autotranslation had a very basic implementation, then many of the associated problems would disappear. In practice I've found it's not really that useful except for passing configuration information between cells (e.g. file paths) because data of non-trivial size needs to hit the disk anyways, where it can be read by other kernels.

Anyways, it seems that polyglot notebooks are here to stay. I habitually use F# within this context (favorite language naturally) but use other languages in the notebook when their native support is a more natural fit. As far as F# kernels are concerned, since Jupyter has replaced Beaker, the ifSharp kernel is the best F# kernel to keep moving forward. I've used ifSharp with SoS/Jupyter and it works great.

zyzhu commented 6 years ago

@aolney Thanks for sharing your experience. Your points clarify my confusion about Beaker and BeakerX. I took a quick look at SoS. It seems that it requires setting up a language module similar to https://github.com/vatlab/sos-r/tree/21883327750a1089066e8933843131d6271bfd74 so that SoS can interop F# with other languages. I found the documentation here https://vatlab.github.io/sos-docs/doc/documentation/Language_Module.html

Is that how you get started on using IfSharp on SoS? Any possibility to create a pull request to share your language module to SoS so that it can support IfSharp kernel out of box? That will help F# community get started on SoS notebook.

aolney commented 6 years ago

I've been using a Jupyterlab installation but I think the process is similar for Jupyter notebook.

I believe this is all it takes:

pip install jupyterlab
jupyter labextension install jupyterlab-sos
jupyter kernelspec install ifsharp 

In other words there's no need for a new language module b/c ifSharp works with Jupyter.

My understanding (could be wrong) is that the links you provided are only needed for certain functionalities like autotranslation and syntax coloring. They may also be needed for future capabilities like intellisense and linting.

zyzhu commented 6 years ago

@aolney Thanks for clarifying. Yes. I was interested in autotranslation so that variables between python and R can be used in F# and vice versa. I already got F# kernel working on SoS.

One step further is to autotranslate between pandas dataframe with Deedle. But it requires Feather format support as that's how it's done between R and python right now. https://github.com/vatlab/sos-r/blob/21883327750a1089066e8933843131d6271bfd74/src/sos_r/kernel.py#L83

You mentioned data needs to be dumped to file before consumed by other language. I see that's how it's done between pandas and matlab now https://github.com/vatlab/sos-matlab/blob/d818cb93b8988bb8ecf9e4910c12fe7ab9538e73/src/sos_matlab/kernel.py#L102 We can do that instead to support Deedle and pandas dataframe interaction.

Guess this would be another long-term project. At least the path looks clear.

siavash-babaei commented 3 years ago

So, say everything existing is updated and working nicely. Well, a proper RProvider would give you access to whatever is missing in F# and just about anything in python and then some. RProvider though has not worked for quite some time since R 3.5.

Missing, Missing, Missing:

siavash-babaei commented 3 years ago

For whatever product, you would require a few killer features that would make it indispensable, and for F#, it could easily be the entire data analytics and data science workloads. The same thing that greatly helped propel python to the front. The user base, especially, being more mathematically inclined and comfortable with the syntax (I just love/adore it but dunno why makes lots of people uncomfortable), ideas of immutability and the core of language being input -> function -> output, would be much better adopters than say, developers active in GUI or web. There are other areas I am sure, for example, business applications that fit nicely with Domain-Driven Design. But data science workloads - incidentally, a perfect match for DDD - are certainly worth the investment, especially as they seem to be exponentially growing both in volume and utilisation. If you think about it, one of the most active open source big data projects, Spark, is only 7 years old - with many users adopting a difficult language like Scala just to use full Spark capabilities and performance. The community as a whole seems to be more-so accepting of learning and new tech that makes their life easier. FsLab could be that unified environment for data analytics pipelines with a comprehensive suite of up-to-date tools accessible from whatever OS, with pieces that have the necessary awareness of each other. Kind of pointless to have a data frame that cannot be readily consumed within the tools that you use to analyse your data: ML.NET has its own data frame and btw, it seems very inferior to that of Deedle; and, Accord.NET has its own extremely horrible way of consuming data in the form of arrays. It is going to be an involved process though starting from selecting a set of standard features the community and more importantly, the language requires in this regard - a lot of input needed from developers and more importantly, users. Further steps could even involve attracting corporate support and money. Ideally, you would end up in an environment like MATLAB, R, or Julia, where you can readily hack quick-and-dirty, just as well as develop polished applications (very clumsy and difficult to do in R/MATLAB and unsound/non-performant in python).

siavash-babaei commented 3 years ago

Corporate support could be subtle, could be a lot of things from adoption and critique to code contribution to money, marketing, etc. For example,

dsyme commented 3 years ago

Adding comments from https://github.com/fslaborg/RProvider/issues/209

It's a good time to finally address this issue. There are many questions being discussed here. Let's just deal with the question of FsLab and its pieces.

Here are my opinions:

Looking forward to making some progress here....

dsyme commented 3 years ago

BTW as can be seen from the discussion above I took a crack at modernizing FsLab to .NET Core in 2018. I was honestly shocked how hard it was back then. It will be much easier now.

FsLab had dependencies on FSharp.Formatting for literate scripting, including Razor for templating, and relied on mono. At the time, FSharp.Formatting was barely functioning on .NET Core, and we only removed the Razor dependency earlier this year.

Anyway, in 2020 I finally went through and added a usable .NET Command line tool to "fsdocs" to FSharp.Formatting which includes the literate scripting functionality of FsLab.

Another major factor is VS and VSCode. VSCode is now the obvious place to centralise all such work.

siavash-babaei commented 3 years ago
WalternativE commented 3 years ago

I'm by no means a solid practitioner (currently trying to get into the field from my background as a software engineer) so I'm talking more opinion than knowledge here.

I think @siavash-babaei already lays down a lot of valid points. I'd just like to add, that it would be great to get an idea of where we are currently as a community in regards to data science/machine learning. Like the people at https://github.com/CSBiology who are using F# in their research and mainly fell beck to writing a lot of things themselves. SciSharp doesn't really have F# on their radar, at least it seems to me that way.

Many FSLab projects are indeed a bit 'stale'. I've been having good success in working with Deedle but there are a lot of thing one could improve (especially in writing docs). From a technology standpoint the new Microsoft data frame is really fancy (being built on Arrow) but I wouldn't have felt as productive as I did with Deedle. ML.NET works well enough with it in combination and even though the API is a bit odd it is quite fun to use - in the sense, that you feel quite productive. If you want to really tweak it, it gets kind of tedious.

The thing I'm missing most in comparison to a mature environment like the R Tidyverse is the ease of going from exploration to wrangling to modelling to validation and back again. In my current work with F# I really feel that the different libraries I use were built by entirely different teams with little to no regard for each other. Yeah you can somehow plot the data in your Deedle frame...you just have to 'un-deedle' it first. Same for modelling in ML.NET. It would be nice having a set of abstractions, that make this easier. At least some common ground with other projects. Like...I'm currently not even sure how many implementations of linear algebra libraries (and libraries that build upon them) are around. I just know, that they most likely aren't compatible. SAFE is great because it is a set of nice defaults. If I don't like the defaults it is trivial for me to swap them out.

Hope this added something other than more confusion to the discussion. 🙇‍♂️

siavash-babaei commented 3 years ago

Thank you @WalternativE. Actually, Tidyverse can act as a blueprint for similar capabilities in F#. The syntax is very nice and functional providing a sort of HowTo. The whole pipeline from importing, to cleaning, transformations, visualisation, modelling, and communicating results must be handled in ONE unified framework and just as importantly with a simplified, compatible, efficient, and intuitive syntax. Now matter how good Deedle is, what's the point if it is not directly consumable in ML.NET or whatever other tool. MBrace might be excellent but what's the point while you don't have access to established tools for the job, also if we want it to take off, there should be ports in C#, R, Python, Java, etc., to attract sufficient userbase and gain a foothold... In any case:

Once we made the adoption and added F# backends, we can begin to affect the underlying projects bit-by-bit if we are quick enough in adoption and too late to party. Caution: with Python BDFL, GVR, moving to Microsoft, will they just can ML.NET which was supposed to be a comprehensive framework in favor of SciSharp?! And will this move sideline F# if we are not careful?!

zyzhu commented 3 years ago

The visions discussed above sound wonderful. However, to be realistic, these visions need a proper business plan and sponsorship for long-term sustainability.

RStudio is behind the huge push to modernize R development since Hadley Wickham joined RStudio in 2013. https://insights.stackoverflow.com/trends?tags=dplyr%2Cggplot2%2Ctidyverse If you check a few tag trends on stackoverflow, Hadley's ggplot2 got popular since 2009, but tags such as tidyverse and dplyr only took off after 2016.

We already got clear indication from Don about what Microsoft team will focus on. I think the community shall just organically improve other individual components or create something from scratch. It is unrealistic to expect a holistic approach to build the whole data science ecosystem until a company like RStudio shows up. Maybe it will never show up.

Many magics in tidyverse rely on the flexibility of dynamic typing. I always feel bad seeing some dplyr samples like the following. To achieve similar result, it will be super verbose in Deedle https://dplyr.tidyverse.org/index.html

starwars %>% 
  mutate(name, bmi = mass / ((height / 100)  ^ 2)) %>%
  select(name:mass, bmi)

Where do name, bmi, mass, height come from? You don't need to declare anything but in R it just works as they are column names in starwars. Or this https://genomicsclass.github.io/book/pages/dplyr_tutorial.html

msleep %>% 
    group_by(order) %>%
    summarise(avg_sleep = mean(sleep_total), 
              min_sleep = min(sleep_total), 
              max_sleep = max(sleep_total),
              total = n())

This kind of exploratory work is very productive for a statisticians. Though the above lines are impossible to maintain, statisticians do not care as productivity of exploratory is the priority. I do not see any static typing language to beat this.

My point is that dynamic typing language and static typing language have their own pros/cons to exist in the world and we have to use both if we are in data science field.

WalternativE commented 3 years ago

Thanks for your insights @zyzhu. Yeah, I remember working in a small lab, that used R heavily before the RStudio people came along. It really wasn't as convenient as it is now (pretty chaotic if I remember correctly).

Regarding Microsofts involvement I see that interactive programming is getting a good push (both FSI and dotnet interactive which I've been using daily since it was stable enough to do so). FSharp formatting - or fsdocs if that's the better name now - is also coming along which is super important for documentation (one of the most important factors for me - I can go through source code but it will take me a while to get everything in my head). XPlot is already a library where I see a bit of friction. The CSBiology team already has a project which is currently living in the official plotly organization https://github.com/plotly/Plotly.NET. One can discuss about the API but it appears (at least from the Plotly side) more complete than XPlot. I'm also eagerly watchin the strides DiffSharp makes - one of the most exciting movements in the .NET space right now I'd say - and am all for backing the project. If there is a good ONNX story for DiffSharp I can totally see using interactive environments to work on models, special computing environments for training and ML.NET as a deployment target whenever a model goes into production (at least that is how it would work in my head - the usual disclaimer: I'm an enthusiast and no expert). Microsoft has its own ideas how to work with rectangular data in .NET. I'm not entirely sure how that's going to play out, I'm not even sure if Microsoft knows.

All of these projects are pretty disconnected from SciSharp. Basically every project I mentioned (apart from interactive programming and plotting) has one (or multiple) alternatives in this organization. From what I've experienced so far they aren't really compatible with most of it. If there were some abstractions we'd share I could see this change. Like in R it is really easy to work with most libraries because the notions of a vector, a matrix and a dataframe are engrained in everything. In the Scipy stack basically everything build on top of NumPy and whenever they use rectangular data they try to offer some sort of pandas interop. Right now - as I said before - we have quite a lot of linear algebra implementations, five - at least that's the amount I'd be aware of - major interpretations of rectangular data (three of them from Microsoft) and widely scattered statistics libraries. At least in the space of classical ML we're down to a narrow field because some of the old contenders simply died out. Compatibility between all of those components isn't really there so comitting to one takes away developer mindshare from the others. I love the F# community but we're simply not big enough to play zero sum games.

Relating to R being more flexible because it is a dynamic language. Yeah, that's right. The API is slicker due to not having to look out for types. As you mentioned, that makes working with it faster and the resulting analyses more brittle. I'm still pretty convinced, that it can be possible to get to a sweet spot where the API is - for the most part - statically typed and nice to work with. F# is capable of doing really impressive stuff with types but I'm totally with you, that it is not really possible to make it as convenient as the same API in a dynamic language. Still, I'm gladly taking the burden of writing more explicitly typed code if I can get some guarantees, that my code actually works (and keeps on working in the future). I don't want to bash R and/or Python but reproducing an environment using any of both languages is just so much pain. .NET is really good at working reliably (yeah - still my point even after the never ending story of moving to .NET Core, One .NET, whatever name they're going with today).

So, to finally get to a point. I can live with not having a one-stop-shop. I am happy with there only being a core of very well maintained libraries and tools, that help me to do DS/ML in .NET. I just think it would be necessary to have a group of people with skin-in-the-game, that talk about some foundational parts and interop. Some place to look to if you want to align with the rest of the "scientifically minded" .NET community. Everything, that enables individual contributors to build something without reinventing the whole stack from the ground up.

I hope my "ramblings" make some sense to you all and maybe contribute a bit to the greater discussion. 🙇‍♂️

zyzhu commented 3 years ago

Just to pile on a bit more. R can work with Python through reticulate package backed by RStudio. RStudio acknowledges the benefits of integrating two complementary ecosystems. https://blog.rstudio.com/2019/12/17/r-vs-python-what-s-the-best-for-language-for-data-science/

"For individual data scientists, some common points to consider:

Python is a great general programming language, with many libraries dedicated to data science.

For organizations with Data Science teams, some additional points to keep in mind:

https://rstudio.com/solutions/r-and-python/ I heard from @cartermp's video that the roadmap of dotnet/interactive might consider interop with Python. I just hope Microsoft has an even grander vision.

If I were in their shoes to push ML and data science, Microsoft shall just acquire RStudio and integrate more with the wide community, similar to the acquisition of GitHub to push OSS, acquisition of Xamarin to push cross-platform development. Anyway, this is a bit off topic.

siavash-babaei commented 3 years ago

Dear @zyzhu, you are obviously right that there certainly are a great many serious hurdles, exacerbated by limitations of community size and corporate support. Then again, it is the question of the Egg and the Chicken: to expand community and attract support, you would need a working blueprint with bells and whistles and features to get some meaningful investment. The issues surely need carefully thought-out technical, business, and implementation plans but they are not by any means far-fetched or u attainable. Similar effort which went to SAFE Stack could just as easily and usefully turn into FsLab Stack. A lot of components are there like in SciSharp Stack or ML.NET but could definitely do with F# sugar which however you cut it, is much less involved than developing from scratch. While it is true that RStudio and Hadley Wickham with the addition of essentially Tidyvers dialect was a big boost for R, R was the biggest name in data science much before that, although the syntax was rubbish compared to now. In comparison, we have @dsyme and @tpetricek and then we got Microsoft which dwarfs RStudio by any measure. Microsoft has recently shown great willingness and initiative in data science from purchase of Revolution Analytics and turning it into Microsoft R, to addition of Python more-or-less as a first grade citizen and now bringing over of GVR. So they are willing to spend the money and contribute to development. We just need to carefully craft an approach and message, utilize current assets within the company and take it from there. I do not expect things to be as easy and concise in F# as is with dynamic languages but F# syntax is inherently more mathematical and less verbose than Python or R in many ways and trade offs in sound design and performance implementation and ... are well worth it. Frankly, I really liked Deedle and it offered a lot of tidyr and dplyr and lubridate functionality, etc. I used Deedle to do a project ingesting, cleaning, and tidying large text files containing sensory data resulting in read-for-analysis CSV files. I did not experience problems, although few operations I needed to write myself but overall very nice and usable with easy and fluid syntax that actually reminded me a lot of R Tidyvers. For this project it was OK since the output was essentially a set of data frames or rather a data cube perhaps exported to files. It would have been problematic though had I needed to do analysis in F# since nothing else understands Deedle data frames. All in all, my experience is with F#, we can have a one-stop-shop, that even though strongly-typed, with its naturally concise and mathematical syntax, F# can still be very viable and pleasant when exploring data as well as much more performant, safe, and easily maintainable when producing final results. This is actually something Deedle demonstrates beautifully that we can have a strongly typed, smooth, concise code that lends itself perfectly with data science and exploratory steps. And frankly, no we don't got to know both static and dynamic languages. What you get in reality is most people using R and Python because they can handle the entire data workload wheather it is basic data analysis, visualization/communication/presentation, or advanced modelling, be it in-memory or in-cloud, on CPU or GPU, both languages having necessary tools and delivering results with quick turn-around. We should have this in mind that while both R (bad core language design with horrible syntax also slow) and Python (very OO and slow) are inherently poorly suited to modern data science, the whole design and philosophy of F# fits just perfectly with data workloads that are always about immutable data structures, Input->Function->Output, with perhaps only visualisations as major side effects.

nhirschey commented 3 years ago
matthewcrews commented 3 years ago

I would love to help with this project. I spent a lot of time in the R/tidyverse before becoming a full time F# dev. Having RStudio and a set of recommended libraries was a huge boost for early adoption. CRAN has a huge number of libraries that exist outside of the tidyverse that are still widely used (Ex: data.table). Having a prescription at the beginning streamlined people entering the ecosystem.

I am looking into creating a Developer Advocate role at my company so I can spend more of my time on accelerating adoption of F# for Machine Learning, Optimization, and Engineering. I know that having someone focus on easing the on-ramp to a language will go a long way in growing the community.

matthewcrews commented 3 years ago

I completely agree with @zyzhu when it comes to the productivity of Python/R. Dynamically typed languages have an "advantage" in that it is easy to start whipping together code quickly. Part of the reason I am so excited for the idea of Erased Discriminated Unions is that it addresses one of the pain points associated with a statically typed language compared to a dynamically typed ones. My hope is that we can build on this and potentially an additional enhancement to Computation Expressions to "re-capture" the productivity of R.

F# is a great language for Data Science/Analytics. I think with a few key enhancements, it could match the productivity of Python/R.

The Anaconda Python distribution solved the problem of "How do I get started?" I don't think F# should go that far, but having a curated list of recommended libraries and walkthroughs go a long way to streamlining adoption. The key metric I look at is, "How long does it take me to ingest an arbitrary CSV file, plot the data, and perform some kind of model fit?" Tightening that loop will be critical.

There are some "ergonomic" issues that could be improved with some key features that I think will make F# a more powerful language and better suited for Analytics/Modeling/Data Science.

cartermp commented 3 years ago

@WalternativE regarding this:

XPlot is already a library where I see a bit of friction. The CSBiology team already has a project which is currently living in the official plotly organization plotly/Plotly.NET. One can discuss about the API but it appears (at least from the Plotly side) more complete than XPlot

Since I'm the current maintainer of XPlot I'll say that I intend on people eventually moving over to Plotly.NET. We spoke with the team over at plotly and they're helping fund the effort. I think that XPlot is pretty good, but anything involving plotly should ultimately just use official bindings when a commercial entity like Plotly is interested in long-term maintenance. They were quite pleased with F# community activity in OSS and felt like it was a good investment.

So, where does that leave XPlot? Unsure, since the plotly package is the most-used and the most feature complete, and I'm likely to encourage people to move to Plotly.NET once it reaches 2.0.0. Its charter of having a consistent-ish API across different charting APIs remains unchanged and it's probably still a fine choice for several tasks.

Anyways, that's how I'd consider the issue of charting with Plotly as a backend moving forward.

WalternativE commented 3 years ago

@cartermp thanks for the update and thanks for your work in XPlot (and your work as our dearest PM while we're at it 🧡🧡🧡). It's wonderful to see, that there are multiple parties working in unison to make a stable data visualization library for .NET.

cartermp commented 3 years ago

@nhirschey and others - how is FSharp.Stats - https://github.com/CSBiology/FSharp.Stats - when it comes to being a good statistical package? I agree that this is one of the biggest gaps now, and I'm also not aware of any plans for Microsoft to publish anything in this space.

siavash-babaei commented 3 years ago

Dear @cartermp . See this example from

// get coefficients of 3rd order regression polynomial 
let regressionCoefficients = Fitting.LinearRegression.OrdinaryLeastSquares.Polynomial.coefficient 3 x_Data y_Data

// get fitting function of 3rd order regression polynomial
let regressionFitFunc = Fitting.LinearRegression.OrdinaryLeastSquares.Polynomial.fit 3 regressionCoefficients 

A better design in my opinion, for example, would yield a single object from which one would then extract various bits per need.

Furthermore, why would one unnecessarily expose users to technical underpinning - like the algorithm used for estimating model coefficients - they would not need most of time. I think people generally could care less whether coefficients have been estimated using OLS or Maximum Likelihood or some form of Gradient Boosting or some entirely different beast.

let regModel = Fitting.LinearRegression.Polynomial.fit 3 x_Data y_Data

let regCoeffs = regModel.Coefficients
let regDiags  = regModel.DiagnosticPlots
let regEstms  = regModel.Estimates
...

Regression is a staple machine learning technique. Better be part of ML load. I think ML.NET already offers similar capabilities. Otherwise seems good.

cartermp commented 3 years ago

Regression is a staple machine learning technique.

Oh for sure, and I think that the lines can sometimes get a little blurred between what is machine learning and what is plain old statistics. But I'm more curious if the library looks as if it could service some of the needs that @nhirschey is speaking towards. The shape of the API could change over time, especially if there is feedback indicating that it's conceptually challenging. I think the key thing is that there are APIs available, though, no matter how strange they may feel.

lqdev commented 3 years ago

Apologies ahead of time for the brain dump.

Love this discussion and would like to contribute in any way to help move this forward. Some of my (biased) thoughts:

I'm not sure that a one stop shop solution can address all of the individual steps in the workflow. For example, in Python, you do your data wrangling with Pandas or Spark, but maybe model with SparkML/Scikit/PyTorch. But having something like FsLab, or maybe now SciSharp as a collection or proposed SAFE stack for ML definitely is something worth doing. So long as these libraries embrace the concepts of interop and extensibility built around standards within the .NET ecosystem as well as the overall DS/ML ecosystem. I'm not sure if this is the best way to go about it. I've created a Gist with different steps in the data science / ML workflow steps and .NET libraries that provide support for it. I'm happy to take suggestions on where something like that may live as well as helping add to it to get the curation part going for actively maintained libraries.

matthewcrews commented 3 years ago

@lqdev I really appreciate hearing your thoughts and they echo many of the concerns that have come to my own mind. I think F# is well positioning to be a "full-workflow" language for Data Science/ML. This includes data wrangling, transformations, modeling, testing, and finally deployment. Python/R have optimized the Data Scientist workflow (from data import to validated model) but they are poor when it comes to putting a model into production. Python/R are great languages but they were not build with Production deployments in mind first and foremost (I'm sure I will catch grief for this). They are great languages, for sure, but the jump from development to production is non-trivial (runtime and library versions can be a real nightmare, not to mention debugging, logging, error reporting, refactoring, etc.).

I realize that my initial comments were off topic from what @dsyme was originally asking for so my apologies. I believe Python/R have benefitted from a "standard" set of beginner libraries. 90% of an analysts work can be done with those initials libraries. If a beginners needs are not met with those initial libraries, it is easy to reach out to the community to find a more apt library (Ex: data.table for data munging). I think it would be worthwhile to have a curated list of "basic" libraries which covered the basic workflows for Data Science/Analytics in the F# stack.

Long term I think that we should consider someone advocating for the "ergonomics" for F# for Data Science/Analytics/Statistics. It goes beyond just having the necessary libraries. The workflow of an analyst is fundamentally different from a software developer. I don't think these things need to be at odds though. I think there can be a synthesis of needs. Much of what I do at Quicken these days is taking the needs of disparate departments and finding the solution that integrates these requirements. There are a couple of key features which could really close the gap with Python/R when it comes to analyst productivity.

I would love to help in easing the onboarding with new F# developers however I can. I'm working with my company to try to create an F# advocate position internally. We will see how that goes. We benefit a lot from the F# language so I hope to make the case we could make investing in the community a first class investment.

nhirschey commented 3 years ago

@cartermp, I wasn't familiar with FSharp.Stats before this thread. Looking through it, it has nice F# syntax for descriptive statistics like Seq.stdDev, etc. That is useful and they're building out a good foundation. I applaud their work.

However from what I can tell, like ML.NET, MathNet.Numerics, Accord, etc., the regression modelling is currently inadequate for research purposes. There are no facilities for panel data models (https://en.wikipedia.org/wiki/Panel_data) or time series models with appropriate standard errors. I had hoped that ML.NET would develop these, but it still only has facilities for plain-vanilla standard error calculations in regressions. This is why despite preferring F# I always have to drop back into R whenever I get to the modelling stage of analysis.

I know this is a chicken/egg problem. And I'm thrilled with the F# 5.0 scripting and notebook improvements. I only point it out in the spirit of "what are some things holding back F# in the data science area."

kMutagene commented 3 years ago

Very interesting thread, lucky that i found out about it via twitter 😄. This post may get quite long, i hope that it can add some value to the discussion

Seems like the time where we from @CSBiology can chime in and try to contribute to this conversation.

Within .NET, F# is the language for analytical workflows. I, as I'm sure many of you do, strongly believe this to be the case.

This is our central dogma for 5+ years now. We are in our core a bioinformatics workgroup at a German university. We are using F# for data analysis and research. We love it's syntax, immutability by default, and that it has a functional-first approach. We believe that for these reasons it is perfect for the general workflows that we see everyday, where you basically have several (complex) mapping and folding steps of data to produce results. This is also the reason why we prefer it over both R and python, which are the de-facto 'standard' in our field due to majority vote.

When we started using F# extensively (and now almost exclusively), we saw a lack of, let's call them standard libraries for our kind of application, the most important ones being plotting and (descriptive) statistics.

Therefore, we were often forced to write our own functionality for specific tasks in our analysis scripts, which eventually resulted in libraries like BioFSharp (our main bioinformatics library), FSharp.Stats, and FSharp.Plotly (which is now Plotly.NET directly maintained under the plotly organization). I am aware that alternatives for some of these libraries were developed in parallel, but the general workflow for data analysis that we are using is the following:

The reason why these libraries may be lacking functionality in specific regards or cover more than the name might suggest is that we improve them on a use-case basis, meaning as long as we don't need new functionality in our analysis pipelines, those features keep a low priority. FSharp.Stats is a good example here, as it contains e.g. pretty extensive fitting and signal processing features you might not expect from just the name, but may be lacking statistical features as indicated by other posts here.

What all of this basically boils down to is the following: we gladly contribute to the F# OSS community, but may not have the time and resources to make these libraries feature complete. We would however love to have more contributors and help shape the data science landscape in F# and .NET. This may also be by integrating our projects into existing ones.

bvenn commented 3 years ago

@cartermp @nhirschey Our idea for FSharp.Stats as also laid out by @kMutagene emerged due to the reason you already mentioned. There was a lack in statistical packages compatible with our F# data analysis workflows. Step by step we aim to spot and implement basic, as well as higher level statistical functions/algorithms applicable in scientific data analysis. As @kMutagene already mentioned the project is extended when further functionalities are required in order to analyze our data.

FSharp.Stats aims to cover topics as:

Regarding data fitting, FSharp.Stats covers simple linear regression with confidence and prediction band determination, polynomial regression, and cubic/smoothing spline regression. Models for nonlinear regression cover several growth curve models and common other models as gaussian, exponential or logistic functions.

If you have suggestions for panel data regression strategies or any other, we would be happy to consider it for future extension of the FSharp.Stats library.

nhirschey commented 3 years ago

@bvenn and @kMutagene, sounds good. I'll do some work with your library to get familiar and then raise some issues in your dev repo where we can discuss how to collaboratively improve the feature set and api ergonomics.

kMutagene commented 3 years ago

@cartermp

Regarding all things plotly (I maintain both Plotly.NET and Dash.NET):

So, where does that leave XPlot? Unsure, since the plotly package is the most-used and the most feature complete, and I'm likely to encourage people to move to Plotly.NET once it reaches 2.0.0.

That sounds like a good plan. Contributions from our side follow the same rules as i pointed out above, so help in reaching that 2.0.0 milestone is definitely wanted. I think the strength of XPlot also comes from it offering additional charting APIs.

Becoming feature complete in Plotly.NET is ultimately a question of metaprogramming, as it is impossible for me alone (or a small group of OSS contributors) to keep up with the changes coming from plotly.js by implementing them by hand. So maybe tagging @Shmew here, as autogenerating the Plotly Feliz bindings seems to be automated already. Plotly offers a JSON schema for all charts and style parameters, so it is in theory possible to auto generate Plotly.NET bindings from that, maybe via something like Myriad.

@nhirschey regarding static image export: Plotly now uses Kaleido for static image export from charts, which i already wrapped in a quick-and-dirty POC here but just had not the time to finish and incorporate into Plotly.NET and Kaleido.

Then there is Dash.NET, which i am currently working on. The python version of Dash is offering jupyter integration, which is also the ultimate goal for the .NET version regarding dotnet interactive. I think when that project matures to this state, we have a pretty nice toolbox for visualization in notebooks.

In general I think that the base work has been done on many ends, and now we need community effort to bring all these ends together, be it a curated package, or just a loose set of components that work well together.

Shmew commented 3 years ago

Yeah Feliz.Plotly generates about 95% of the bindings from the JSON schema. Currently it's all done via building strings so I could get something released. If the list style API isn't considered too foreign for normal F# code I think it should be possible to build a generation library that's implementation independent. We've discussed it a little bit in an issue in the XPlot repo. I am quite confident as far as the end-result is, as I'm currently using it in a production Fable app, and it has all the plotly.js examples implemented in the docs.

siavash-babaei commented 3 years ago

Ok, so this has taken a sharp turn towards serious Plotly discussion. Maybe we steer it back towards FsLab and its present and future.

dsyme commented 3 years ago

SciSharp doesn't really have F# on their radar, at least it seems to me that way.

I have discussed this with Haiping, who is one of the technical originators and guiding forces behind SciSharp.

He is very open to aligning F# work into SciSharp, by

  1. adding F# samples/documentation
  2. changing the SciSharp messaging to be about “C# + F#” or "F# + C#" as documentation becomes available.
  3. bringing some F# community projects under that umbrella where appropriate

He has made me co-admin on https://github.com/SciSharp to help facilitate changes. Of course I can't make all these changes myself - but it helps in adjusting messaging, getting multiple perspectives etc.

So basically the door is open on the SciSharp side to work towards merging some efforts here towards a win-win for .NET, F# and C# in this space.

This of course includes continued cooperation and communication with other projects both in and out of the SciSharp umbrella. For example, I'm working on DiffSharp, which is not under the SciSharp umbrella.

This is not a full solution, but it gives us options which change the status quo.

matthewcrews commented 3 years ago

I'm all for merging efforts with SciSharp. My main concern will be whether the SciSharp libraries will have APIs that are F# friendly. One of the pain points of ML.NET has been that the API does not play well with F# paradigms, at least that has been my experience. I believe wrapping OO method calls in F# functions is frowned upon so I am unsure of the best way to address the style differences of C# and F#.