DataFrame Discussion - Githubissues

MgSam commented 6 years ago

With work underway on Tensor, and the new perf optimizations available via Span, I think it's time the .NET team seriously considered adding a DataFrame type. Working with large amounts of data has become critical in many applications these days, and is the reason libraries like Pandas for Python have been so successful. It's time to bring similar capabilities to .NET, and it needs to start with the framework itself adding a DataFrame type.

There are DataFrame libraries out there available for .NET, but the problem is that they each have their own implementations of DataFrames, that are minimally compatible with the BCL or any of the other libraries out there. From my experience, support for many of these various libraries is also pretty weak.

I think the BCL team implementing its own DataFrame is the first, most important step to improving the state of working with data in .NET. All we have right now is DataTable, a powerful but ancient type that is not well optimized for many scenarios and which I've variously seen .NET team members refer to as "legacy".

I'm creating this issue to garner opinions on the topic. I don't have a specific API in mind to propose at this time, as gauging interest first is probably more important.

Let's make .NET as great for data analysis as any other platform.

danmoseley commented 6 years ago

Great to get community interest. cc @eerhardt @ericstj who have been thinking about this.

mgravell commented 6 years ago

Could you define "DataFrame" for the purpose of this discussion? Because that is a vague, overloaded, and open-ended term - it would be good to know the intent as it relates to this discussion. For example, pandas defines it as:

Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects.

which sounds ... awful :) I can't think of many scenarios where that would be preferable to, say, a List<T> or Span<T>...

eerhardt commented 6 years ago

which sounds ... awful :) I can't think of many scenarios where that would be preferable to, say, a List or Span...

I think the scenarios come from a more "data scientist" and Machine Learning angle. Take a look at https://github.com/ageron/handson-ml/blob/master/02_end_to_end_machine_learning_project.ipynb starting with the

In [5]: housing = load_housing_data() line.

Imagine trying to explore a data set (loaded through .csv or otherwise) like this in .NET code, and how much more code you were need to write to accomplish something like:

housing["ocean_proximity"].value_counts()

mgravell commented 6 years ago

that sounds like something that LINQ would excel at, with little more than a few helper extension methods - i.e.

var counts = housing.ValueCounts<string>("ocean_proximity");
// the <string> here just makes the return type stronger, could be done without

with an extension method something like below (completely untested). Note: I'm totally not a fan of DataSet for most purposes - I'm just trying to understand what problem this suggestion is trying to solve, and how it would be different to DataTable. I'm very cautious of the idea of adding another type that is basically the same as DataTable to represent ad-hoc data without a very good reason. If the reason is that DataTable is too heavy and has a lot of things (like change tracking, etc) that never apply, then that's great - I just think having a very clear statement of the problem is a prerequisite to having a good solution.

public static Dictionary<TKey, int> ValueCounts<TKey>(this DataTable table, string columnName)
{
    var col = table.Columns[columnName];
    return table.Rows.Cast<DataRow>().GroupBy(x => x.Field<TKey>(col)).ToDictionary(
        x => x.Key, x => x.Count());
}

Edit: personally, I'd much rather use a class HousingData POCO that looks like the shape of the data, i.e. has a Longitude, Latitude etc - along with great tools for loading that; but I acknowledge that not all scenarios fit the "we know the shape ahead of time" world.

MgSam commented 6 years ago

@mgravell I think the Pandas definition is as good as any.

You're right, it is very similar to a DataTable. However, DataTables are heavy, not generic (and thus box any value types) and do not support many of the useful properties of DataFrames.

I love Linq, but it is different type of abstraction. Common operations with DataFrames include aligning columns, performing operations between columns, joining frames, etc. Writing ad-hoc types or using Linq for this is much more verbose and time consuming. A key insight to realize about DataFrames vs Linq is that you're working with columns or frames of data vs individual items, as you do with Linq.

I strongly suggest looking at some example usage in Pandas or checking out Deedle. (Deedle has a pretty good implementation of a DataFrame, but it is written in F# and thus uses a lot of F#-isms that are problematic for other .NET languages, it also appears to be essentially abandoned).

ghost commented 6 years ago

but it is written in F# and thus uses a lot of F#-isms that are problematic for other .NET languages

Can you explain a bit what is F#-ism and what are those problems? If I am relying on a nuget package with .NET assembly, I shouldn't care about the source language the IL was produced from. Their usage docs has C# examples: http://bluemountaincapital.github.io/Deedle/csharpseries.html

MgSam commented 6 years ago

@kasper3 F# adds a lot of its own types and conventions on top of the .NET Framework. It's in FSharp.Core.dll. You absolutely can use it from C#, and it is very powerful, but you quickly run into a lot of annoying impedance mismatches.

The biggest one being that F# has an Option<T> type. So Deedle is written to work with Option<T> rather than nullables. This means for double?, you have 3 different ways to say null- Double.NaN, null, Option<double?>.None. It makes it super annoying and error prone when doing analysis.

Though I digress. Really I'm just trying to show why there is a need for a DataFrame type to be baked into the core framework.

mattdot commented 6 years ago

I've been trying to build a UWP app which reads data from IOT sensors and I've been banging my head against this exact problem. I can't use pandas or R because I'm in UWP-land and C# doesn't have a decent DataFrame API to use MathNet with in the same way I'd use SciPy with Pandas. I'm pretty close to just abandoning .NET and building the whole thing in Python with the desktop bridge. Unfortunate but there's really a gap in this area where Python and R excel. This would obviously be most useful on the server-side ML pipelines, but wanted to highlight how this is also useful to client developers. Particularly when developers will want to pre-process data they're feeding into ONNX models for WindowsML.

voltcode commented 6 years ago

@mattdot check out the issue I referenced, it seems there's a new dataframe-ish abstraction called IDataView in just open sourced ML.NET library. Maybe it can be useful to you?

ghost commented 6 years ago

I'm pretty close to just abandoning .NET and building the whole thing

if there is some API shape proposal that you have in mind that would help your use-case, please do share. that will increase the chances of getting this discussion move forward.

veikkoeeva commented 6 years ago

@MgSam So DataFrame is column oriented instead of row oriented? Do chime in to https://github.com/dotnet/machinelearning to share your views. Why not @mattdot also, maybe that's doable already.

MgSam commented 6 years ago

@veikkoeeva I'm not sure column-oriented vs row-oriented is the right description. It's a tabular data structure. It definitely makes column-based operations easy and much more readable. This definition for an R data frame is pretty good.

As soon as I get some free time I intend to play around with the IDataView.

danmoseley commented 6 years ago

@glebuk @TomFinley

glebuk commented 6 years ago

The ML.NET has the concept of "IDataView." (IDV) We are porting documentation for it in this PR 173 We hope it would provides some context for this discussion. It is similar to DataFrame but have significant difference. Dataframe is fully materialized schematized data table in memory that supports random access. IDV is similar to a SQL View. It is also schematized view of data, but it is accessed with cursors that allows for streaming. While DataFrame is very good for ad-hoc data exploration, the IDV was optimized for very efficient, scalable (inf rows by billions of features) data plane for Machine Learning pipelines.

bryanrcarlson commented 6 years ago

Please excuse my extreme naivete, but does a .net core implementation of Apache Arrow relate to this?

Wes McKinney recently posted on his blog announcing Ursa Labs which would open .net up to a whole new crowd if there were collaboration.

voltcode commented 6 years ago

@bryanrcarlson I mentioned Apache Arrow as a means for ML.NET to integrate with R and Python - check out https://github.com/dotnet/machinelearning/issues/69 if you're interested

mungojam commented 6 years ago

I've been a C# programmer for 10 years, but in the last few years have learnt R and now Python. I love C#, but it is really lacking in this area when compared with Pandas in Python and dplyr/tidyr in R.

Those tools make it so easy to work with tabular data and for people who we want to move out of Excel for LoB data tools they are a much easier sell than C# at present.

Both R and Python benefit here during the exploration stage by being dynamically typed. They also benefit from having a really good REPL story.

I think a neat C# implementation would start in the dynamic world using ExpandoObject or similar and would need good REPL runtime intellisense. It would then let coders generate the statically typed classes and code when they are ready to generate an actual tool. The readr package in R applies a similar approach when reading CSV files, it can auto-detect the column data types when you first import a file but then prints out the R code that you should put in to specify those types explicitly (and encourages you to do so).

The column vs. row thing is important for performance. Columnar data is generally the way things are going for performance reasons, e.g. the Parquet data format. A lot of the operations people do on data tables benefit from using SIMD along a column of contiguous memory data.

MovGP0 commented 6 years ago

I agree that a DataFrame would be useful, especially in the Data Science and Machine Learning area.

It's basically an n-dimensional tensor (could be a scalar, vector, matrix, or higher level tensor) which has label-vectors for the individual dimensions.

The labels should be preserved as far as possible when mathematical operations are applied.

MovGP0 commented 6 years ago

I guess a DataFrame type must not (but may anyways) be part of the core library. Using something like the Deedle NuGet should do the job nicely. Especially when you can convert from and to the Tensor type.

totalgit74 commented 6 years ago

I arrived at this discussion as I too am searching for a suitable data analysis framework. Deedle is ok but suffers very bad performance on even moderately large datasets. Extreme Optimisation has an implementation that is very fast from what I have played around with. I am mainly using R for data analysis, particularly the data.table class which has extremely good performance using into the hundreds of millions of data rows. The problem I am finding is that I want the performance and flexibility of data.table but in a language where I can implement financial instruments (from simple swaps to structured deal types) a lot better like C# - OO, extension methods, interfaces, abstract classes etc etc. The lack of suitable implementation in C# means I am forced into using S4 classes and OO in a functional language. Another issue is the lack of a decent IDE. I still think nothing comes close to the abilities of the MS IDEs

danmoseley commented 6 years ago

Adding to ML.NET "project" just to keep track (no decision implied but this is a good discussion)

hrichardlee commented 6 years ago

I am extremely in favor of this! Here goes a long post; thank you to anyone who gets through it all.

API shape proposal-ish

To try to move the discussion forward, I'll put forward something that looks kind of like an API shape proposal. It's actually just a few core operations from pandas, but hopefully it gives a sense of what operations a real API shape proposal would have to cover. A real API shape proposal would need to be much more C#-like and cover a much wider set of scenarios. I'm using pandas here, but Matlab and R basically smell the same, the way that C/C++, Java, and C# all smell the same.

First I'll define a dataframe to make all of the examples actually runnable.

import pandas as pd
df = pd.DataFrame([[1, 2, 'a', pd.Timestamp('2016-01-01')],
                   [3, 4, 'b', pd.Timestamp('2016-01-02')],
                   [5, 6, 'c', pd.Timestamp('2016-01-03')],
                   [7, 8, 'a', pd.Timestamp('2016-01-04')],
                   [9, 10, 'b', pd.Timestamp('2016-01-05')],
                   [11, 12, 'c', pd.Timestamp('2016-01-06')]],
                  columns=['int1', 'int2', 'strs', 'dts'])

Here are the core operations that a dataframe needs to be able to support:

Selection:

# this selects just the rows where int1 is even (selecting a subset of columns isn't shown here)
df[df['int1'] % 2 == 0]

Indexes for fast selection:

# efficiently select just the rows between these two dates
df = df.set_index('dts')
df.loc[pd.Timestamp('2016-01-01'):pd.Timestamp('2016-01-04')]

Projection and Append:

# for each row, this multiplies int1 by 2 and adds int2 (i.e. a projection)
# and puts that new value in the int3 column (i.e. appending a column)
# appending a row is not shown
df['int3'] = (df['int1'] * 2) + df['int2']

Assignment:

# updates the int1 column in row 2 to be 100
df[2, 'int1'] = 100

(this seems a little contrived, but this is the building block for common operations like replacing all NaNs with 0)

GroupBy:

df2 = df.groupby('strs', as_index=False)['int1'].sum()

> strs ints
> a       8
> b      12
> c      16

Row/column labels:

# columns and rows have names that can be changed
df2 = df2.rename(columns={'int1': 'int1_summed'})

Join:

pd.merge(df, df2, how='inner', on='strs')

Sort:

df.sort_values(by=['str', 'int1'])

There are other things, like reading/writing CSVs and SQL tables, resampling and other time-related functions, rolling window calculations, more advanced selection, pivotting and transposing (e.g. see the table of contents at https://pandas.pydata.org/pandas-docs/stable/) but I think those will fall into place fairly easily once the core operations described above are defined. No comment here on dataframes being mutable/immutable (some of the syntax I showed here strongly suggests mutability because that's how pandas/Matlab/R work, but I don't think that has to be the case).

How is this different from LINQ on a `List<T>` of POCOs?

Appending columns, selecting a subset of columns, or renaming column labels doesn't work well if you have a typed POCO. Even if you use ExpandoObject, ExpandoObject doesn't allow setting new properties with a string (e.g. expandoObjectRow.Set("nameOfProperty", value)), which is an important limitation. So if you go down this road you have to choose between creating a new type for every stage of your operation and boxing value types.
Joining is hard (aka verbose). E.g. doing an outer join on xs and ys on on would require something like:

var ysDict = ys.ToDictionary(y => y[on], y => y);
var xsDict = xs.ToDictionary(x => x[on], x => x);
xs.Select(x => (x, ysDict.TryGetValue(x[on], out var y) ? y : null))
    .Concat(ys.Where(y => !xsDict.ContainsKey(y[on])).Select(y => (null, y)));

(and this hasn't even solved the appending columns problem!)

The performance benefits of having columnar data in memory that @mungojam mentions above is important when you have, say, 100 million rows.
Indexes for fast selection are not available, especially for a range query like I show above

How is this different from DataTable?

In addition to the good points above by @mgravell (heavy, i.e. change tracking) and @MgSam (box all value types), I'll also point out that DataTable doesn't have the required richness. I'm not super familiar with DataTable, so I might make some mistakes here, but I believe it doesn't support joins, renaming columns, indexes, assignment to a single cell, or any of the "bonus" things I mentioned above as non-core functionality, like resampling and other time-related functions, pivotting and transposing.

Why should dataframes be added to .NET?

Developers that love .NET are abandoning .NET for python because of the lack of dataframes and the associated ecosystem. At my company, we've used solely .NET for years, and have recently had to add first-class support for python in order to support researchers/scientists/data analysts that need tools that .NET doesn't have. To this point, it doesn't just end at DataFrames. In some ways, all of the comments asking how this is different from LINQ and DataTable etc. are right in a way--if .NET just added DataFrames and stopped there, especially just the core operations that I proposed above, it wouldn't be an earth-shaking development in the history of .NET. But if everything else came together, either from the .NET team or from the open source community, like a fully rich dataframe API (i.e. feature parity with pandas/R/Matlab), a REPL (we're almost there but we need 64-bit support!), Jupyter notebook support, a plotting library on par with matplotlib/Matlab/R, a library of statistical functions on par with pandas+scipy/Matlab/R, and an machine learning framework on par with Tensorflow/PyTorch (i.e. ML.NET), then that would basically open up .NET to an entire new set of use cases. This obviously won't happen overnight, and there's a reasonable argument that the python/pandas ecosystem have an insurmountable lead, but changing the world requires unreasonable people :).
There's an opportunity. Python/pandas currently holds the lead, I believe, because R and Matlab don't have a good story for general purpose programming, i.e., you can't write a web server in R/Matlab the way you can with Python/Django. But, pandas is at a critical juncture, because datasets have grown to 100 million rows or more, and the current architecture of pandas is limiting: http://wesmckinney.com/blog/apache-arrow-pandas-internals/. I think if you look at the Arrow project and the problems it's trying to solve, that gives a pretty good description of the opportunity. I'll give my spin on it here:
- Pandas is "slow" at manipulating very large datasets.
- Another aspect of the opportunity is the IDataView thing, i.e., the ability to reference an underlying potentially massive datastore and perform lazy operations on it, and then materialize the data and continue working on it using the exact same API. This is related to the previous point that pandas wasn't built in a world of large datasets, and that means that it needs a new paradigm for loading large datasets, not just the data manipulation aspect. (I haven't dug deeply into IDataView much yet, so I might be misunderstanding this.)
- There's a distributed computation aspect as well. This is what Spark dataframes offers, which is why pandas-Spark compabitility is an important angle of the Arrow project. DryadLinq anyone? :) Also see dask: https://dask.pydata.org/en/latest/
Finally, I think there's an opportunity to make dataframes optionally typed like @mungojam talks about above, which would be really amazing. Also see @tpetricek's paper http://tomasp.net/academic/papers/pivot/pivot-ecoop17.pdf, especially figure 4. It's always a tough and contentious argument to make, but I believe (mostly on faith) that being able to optionally specify types statically would make data scientists more productive the way that some people believe that static types make general purpose programmers more productive.

A full example of how dataframes are used in real life

I'll go through a real-ish example here in pandas to demonstrate how dataframes are used and how C#/.NET as it stands are inadequate for the use case. Let's say you have a financial dataset as a CSV that has an "id" column (which might represent a stock or a bond), a "date" column, and a "value1" and "value2" column. Let's say you have a second dataset that contains returns for "id"s and "date"s. The first thing you need to do is read the CSVs, but I'm going to start with some synthetic data for convenience:

import random
import string
import numpy as np
import pandas as pd
import sklearn.linear_model

def random_string(n):
    return ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(n))

ids = np.array([random_string(6) for _ in range(900)])
dates = pd.date_range('2013-01-01', '2018-06-01')
df1 = pd.DataFrame({'id': np.tile(ids, len(dates)),
                    'date': np.repeat(dates, len(ids)),
                    'value1': np.random.rand(len(ids) * len(dates)),
                    'value2': np.random.rand(len(ids) * len(dates))})

df2 = pd.DataFrame({'id': np.tile(ids, len(dates)),
                    'date': np.repeat(dates, len(ids)),
                    'returns': np.random.rand(len(ids) * len(dates))})

Here's what df1 looks like:

>>>               date      id    value1    value2
>>> 0       2013-01-01  132ZXV  0.769004  0.237304
>>> 1       2013-01-01  HLX7J7  0.241722  0.273988
>>>             ...     ...       ...       ...
>>> 1780198 2018-06-01  N5EIUS  0.655888  0.220536
>>> 1780199 2018-06-01  MF8YVT  0.458046  0.118847

Let's say you want to replace any NaNs in the data with 0.0, multiply value1 and value2 together, take a 5-observation rolling sum of that per id, take the zscore of that for each date, join that against the returns, and do a rolling 250-day linear regression on each day. Here's what that would look like in pandas:

df1 = df1.fillna(0.0)
df1['value3'] = df1['value1'] * df1['value2']
# there's some ugly cruft here--no one said pandas is perfect!
df1['value3_rolling_sum'] = df1.groupby('id')['value3'].rolling(5).sum() \
    .reset_index().drop('id', axis=1).sort_values(by='level_1').set_index('level_1')['value3']
df1['value3_zscore'] = df1.groupby('date')['value3_rolling_sum'].apply(lambda gr: (gr - gr.mean()) / gr.std())
merged = pd.merge(df1, df2, how='left', on=['id', 'date'])
merged = merged.set_index('date')
models = []
for d in merged.index.unique():
    regression_inputs = merged.loc[d - pd.DateOffset(250):d]
    model = sklearn.linear_model.LinearRegression()
    models.append(model.fit(regression_inputs[['value3_rolling_sum']], regression_inputs['returns']))

This is very long, but I wrote out the roughly equivalent thing in C# to try to bring home my point:

// prepare synthetic data
var random = new Random();

var alphanumeric = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
string RandomString(int n)
    => new string(Enumerable.Range(0, n).Select(_ => alphanumeric[random.Next(alphanumeric.Length)]).ToArray());

var ids = Enumerable.Range(0, 900).Select(_ => RandomString(6)).ToList();
var dates = Enumerable.Range(0, 1978).Select(i => new DateTimeOffset(2013, 1, 1, 0, 0, 0, TimeSpan.Zero) + TimeSpan.FromDays(i)).ToList();

var df1 = dates.SelectMany(date => ids.Select(id => (Id: id, Date: date, Value1: random.NextDouble(), Value2: random.NextDouble()))).ToList();
var df2 = dates.SelectMany(date => ids.Select(id => (Id: id, Date: date, Returns: random.NextDouble()))).ToList();

// replace nans and multiply Value1 and Value2
var value3s = df1.Select(row => (row.Value1 == double.NaN ? 0.0 : row.Value1) * (row.Value2 == double.NaN ? 0.0 : row.Value2));
// Within each id, rolling 5 day sum
var value3RollingSums = df1.Zip(value3s, (origRow, value3) => (origRow, value3))
    .GroupBy(row => row.origRow.Id, row => row.value3)
    .SelectMany(gr =>
    {
        var values = gr.ToList();
        var result = new double[values.Count];
        for (var i = 0; i < values.Count; i++)
            for (var j = Math.Max(0, i - 4); j <= i; j++)
                result[i] += values[i];
        return result;
    });
// Within each date, compute a zscore
var value3Zscores = df1.Zip(value3RollingSums, (origRow, value3RollingSum) => (origRow, value3RollingSum))
    .GroupBy(row => row.origRow.Date, row => row.value3RollingSum)
    .SelectMany(gr =>
    {
        var values = gr.ToList();
        var average = values.Average(); // assume an implementation
        var standardDeviation = values.StandardDeviation(); // assume an implementation
        return values.Select(v => (v - average) / standardDeviation);
    });
// merge with df2
var df2Dict = df2.ToDictionary(row => (row.Id, row.Date));
var merged = df1.Zip(value3Zscores, (origRow, value3Zscore) => (origRow, value3Zscore, df2Row: df2Dict[(origRow.Id, origRow.Date)])).ToList();
var models = new List<LinearRegressionModel>();
// do linear regression
foreach (var date in merged.Select(row => row.origRow.Date).Distinct().OrderBy(date => date))
{
    var startDate = date - TimeSpan.FromDays(250);
    var data = merged.Where(row => row.origRow.Date >= startDate && row.origRow.Date <= date).ToList();
    // assume an implementation of LinearRegressionModel
    var model = LinearRegressionModel.Fit(data.Select(row => row.value3Zscore).ToArray(), data.Select(row => row.df2Row.Returns).ToArray());
    models.Add(model);
}

If anything I would say that the C# version here looks better than it actually is in real life. In real life, this isn't the only computation we're doing. We're trying out other ways of normalizing the data, we're inspecting the dataframe after every step to double-check that our calculations are doing what we meant to do, we're trying other models like Lasso and Ridge and Random Forests, we have more than just one variable that we're trying to fit, we're trying rolling sums and regressions based on days rather than observations and vice versa. I would say that throwing all of these things into the mix make it even harder to do in C# and only trivially more difficult to do in Python/pandas.

totalgit74 commented 6 years ago

I think the choice of indexing will be interesting. The df[df['int1'] % 2 == 0] style of indexing can, at best, be described as fugly. It smells a bit MatLab-ish. The R data.table syntax of dt[int1 %% 2 == 0] is somewhat more elegant and a touch SQL/LINQ like. It also handles numeric references such as dt[1:20], column subsetting as well dt[1:20, list(int1, int2, int3)] as well as grouping within the brackets. Now some of these things may be a step too far with a .group() method being more appropriate, for instance. Ability to reference from the end of the dataframe much like MatLab's m[end-9:end] or an open request like m[end-9:] or m[-9:] would also be good.

What pandas does bring to the table are the inbuilt interpolation, periodicity change, rolling etc calculations whereas in R you resort to other libraries such as zoo.

I think that, as .Net is a little late here, it is worth trying to utilise the Apache Arrow and Parquet work as you don't want to be both late to the party and the odd one out. Both Python and R are headed down this road and, if .Net wants to be part of the data analysis / machine learning party, .Net should go that way too.

mungojam commented 5 years ago

Sorry to spam, but any update on whether this is likely to happen? Our department might come back from the brink of python if a pandas-like library was on the horizon

danmoseley commented 5 years ago

cc @shauheen @chris-lauren

joshnyce commented 5 years ago

Just to chime in, this would make all the difference. Pandas/DataFrames have made Python exceptional for working with timeseries. LINQ and SQL Server are both very weak for timeseries. I don't know about ML, but there are lots of uses for Pandas outside of ML. Take finance for example. We need tools for working with timeseries and operations like joins in-memory because performance is critical, but we also need code to be stable and performant. It doesn't make sense to use Python for executing trades, capturing data from APIs, etc, so we end up passing data to/from Python. Oracle and timeseries-based databases like KDB can be helpful, but performing operations in code is often critical. Python.NET can bridge Pandas and .NET to a degree, but there is no efficient way to pass data to Python as a dataframe or to convert C# arrays or dictionaries to dataframes in Python. Time keeps going by and Microsoft keeps falling further behind, and some of the comments in this thread give the impression that Microsoft still fails to understand why Pandas is so popular.

nh43de commented 5 years ago

Just to chime in, this would make all the difference. Pandas/DataFrames have made Python exceptional for working with timeseries. LINQ and SQL Server are both very weak for timeseries. I don't know about ML, but there are lots of uses for Pandas outside of ML. Take finance for example. We need tools for working with timeseries and operations like joins in-memory because performance is critical, but we also need code to be stable and performant. It doesn't make sense to use Python for executing trades, capturing data from APIs, etc, so we end up passing data to/from Python. Oracle and timeseries-based databases like KDB can be helpful, but performing operations in code is often critical. Python.NET can bridge Pandas and .NET to a degree, but there is no efficient way to pass data to Python as a dataframe or to convert C# arrays or dictionaries to dataframes in Python. Time keeps going by and Microsoft keeps falling further behind, and some of the comments in this thread give the impression that Microsoft still fails to understand why Pandas is so popular.

This is a great conversation.

Has anyone started working on a spec for what this might look like as part of .NET?

chris-lauren commented 5 years ago

We are beginning to dig into this but still fairly early in the process. I can assure you all we love pandas. Its an amazing library and we do fully appreciate the need for an equivalent in .NET. We simply are resource constrained at the moment and would love contributions / help spec'ing out and implementing

nh43de commented 5 years ago

Awesome! Any existing resources you can point us to? Would love to help out

bryanrcarlson commented 5 years ago

I'm interested in helping too. I'm not going to steer the ship but I can help with grunt work.

voltcode commented 5 years ago

Some time has passed since my first comment in this thread. I'd seriously recommend anyone considering specifying this to look into Apache Arrow project (in-memory column store model), and others building on top of it (like https://rapids.ai/, also https://arrow.apache.org/powered_by/). It's a huge chance to catch up by plugging into existing data science ecosystem. It's too late to try to win others (Python, R), last chance is to join them.

In terms of the actual API design, .NET could again leverage Arrow implementers' experience and pick the best route. Being able to efficiently interop on large data sets with other processes would be a huge win for all data-processing programs. GPU-interop is another great boon.

To sum up, I think Apache Arrow .NET implementation could be a good foundation for .NET data frame.

veikkoeeva commented 5 years ago

As a complete layman looking at Arrow is very appealing. It took a while to look through this thread about Arrow, so I gather the few pieces here to save some trouble for other:

https://github.com/dotnet/corefx/issues/26845#issuecomment-391793741 https://github.com/dotnet/corefx/issues/26845#issuecomment-406106187

And the posts just below these.

nh43de commented 5 years ago

Looks like someone just finished an initial C# implementation of Arrow

https://github.com/apache/arrow/tree/master/csharp

bryanrcarlson commented 5 years ago

Good find, @nh43de.

Is this the way to go then; to build upon RecordBatch and implement an API on top (with array and index notation similar to pandas and/or data manip verbs similar to dplyr)? Or should we still consider defining a DataFrame type that can load in Arrow format?

PyArrow has a Table class that is a "tool to help with wrangling multiple record batches..." PyArrow Tables.

veikkoeeva commented 5 years ago

One thing to note is that a lot of data is in netCDF or geoTIFF format. By a lot I mean many of the scientific data sets having large amounts of data. It would be great if the .NET story working with these formats were better. If not otherwise, maybe reading something like geoTIFF (these are used as "raster data sets") and storing the data in Arrow format for other work. Now the tools are fairly non-existent and even the ones wrapping ones like gdal_translate (to give search engine fodder) could benefit of more people improving them. I think this is closely related to issues with .NET and geographic types for which there's some push currently.

(There's https://github.com/predictionmachines/SDSlite but it is abandoned.)

Oceania2018 commented 5 years ago

We're building the underlying project of machine learning, which implements ndarray and dataframe. Some discussion about SciSharp STACK 's Pandas.NET https://github.com/SciSharp/Pandas.NET/issues/13

danmoseley commented 5 years ago

@pgovind is now looking at this.

xhochy commented 5 years ago

As a short heads-up: The Arrow C# implementation is getting some drive in the last days again: https://github.com/apache/arrow/tree/master/csharp It would be nice to have some further eyes and contributors on this as the core Arrow developers are all (yet) no .NET experts.

eerhardt commented 5 years ago

@xhochy - I've been looking at the Arrow C# implementation for the past couple of days, and I have a few proposed changes that I'm working on - mostly performance enhancements.

https://issues.apache.org/jira/browse/ARROW-4502 https://issues.apache.org/jira/browse/ARROW-4503

I hope to find more places I can help out there.

jackimburgia commented 5 years ago

I used R and Python data frames for a long time. But being primarily a C# developer, I longed for two things: type safety and LINQ. So a few years ago I created a C# data frame that stores typed data and inherits from a generic list. Data wrangling is done with the familiar LINQ methods Where, GroupBy, Select, etc.

Is it more verbose than R or Python? Yes. Do you approach data analysis the same way? Possibly, but probably not. Does it handle large datasets well? It's OK... the eyeball test says R wins.

But, I used this every day for a year to do data analysis and was pleased. Please tell me what you think- I'm interested in hearing thoughts / comments. If this has legs I might be interested in collaborating to improve it; I haven't worked on it in a while:

https://www.nuget.org/packages/Frames/

Here are some examples:

Creating frames

Frame employees = Frame.ReadCSV<string, DateTime, double, double, double, double, double>(@"http://www.spearing.com/files/Employees.csv");

 Frame frame = new Frame();

 frame["Names"] = new string[] { "Bob", "Mary", "Joe" };
 frame["StartDate"] = new DateTime[] {
    new DateTime(2016, 10, 1),
    new DateTime(2016, 6, 8),
    new DateTime(2017, 9, 2)
 };

 frame["Ages"] = col(41.0, 28.0, 35.0);
 frame["LowScore"] = col(78.0, 81.0, 85.0);

 frame["HighScore"] = new double[] { 90.0, 92.0, 87.0}.ToColumn();

Retrieving data:

double averageAge = frame["Ages"].As<double>().Average();
double totalScore = frame["HighScore"].As<double>().Sum();

double[] ages = frame["Ages"].As<double>().ToArray();

List<double> lowScores = frame["LowScore"].As<double>();

Column calculations:

 frame["ScoreDiff"] = frame["HighScore"].As<double>() - frame["LowScore"].As<double>();
 frame["HighPlus1"] = frame["HighScore"].As<double>() + 1.0;

 frame["Hours"] = new double[] { 25, 30, 38 };
 double[] hourlyRate = new double[] { 15, 20, 12 };

 frame["Pay"] = frame["Hours"].As<double>() * hourlyRate;

Generic collections:

 public class Employee
 {
     public string Name { get; set; }
     public int Age { get; set; }
     public double HighScore { get; set; }
 }

 List<Employee> employees = new List<Employee>()
 {
     new Employee() {Name = "Bob", Age = 40, HighScore = 90.0 },
     new Employee() {Name = "Mary", Age = 28, HighScore = 92.0 },
     new Employee() {Name = "Joe", Age = 35, HighScore = 87.0 }
 };

 Frame employeesFrame = employees.ToFrame();

Filtering:

 DateTime startDate = new DateTime(2016, 9, 1);
 Frame newEmployees = frame
     .Where(row => row.Get<DateTime>("StartDate") >= startDate)
     .ToFrame();

Grouping:

 Frame empYearSummary = frame
     .GroupBy(row => row.Get<DateTime>("StartDate").Year)
     .Select(grp => new
     {
         Year = grp.Key,
         AverageAge = grp.Average(row => row.Get<double>("Ages")),
         Count = grp.Count()
     })
     .ToFrame();

mungojam commented 5 years ago

@jackimburgia that looks like just the sort of API that is needed and is really promising. Is it open source? I couldn't see any links. Does it store data as columns or rows?

jackimburgia commented 5 years ago

@mungojam - thanks for the feedback. It stores data as columns (typed collections); but typically I interacted with the data in the context of a row (I hope that makes sense).

I haven't opened it up yet- I put it on Nuget about a year and half ago to gauge interest / to see if this is something the .Net community wants or needs.

eerhardt commented 5 years ago

FYI to anyone who is following this discussion. We've started a prototype of a .NET DataFrame over in corefxlab. https://github.com/dotnet/corefxlab/pull/2656. Feedback, contributions, help, etc. would be greatly appreciated.

MgSam commented 5 years ago

@eerhardt Awesome! You're probably already aware of the work on Variant, but IMHO it seems like it could be a great fit for a DataFrame to avoid boxing in the common case when you're working with primitive value types.

jackimburgia commented 5 years ago

Good luck with the new effort. I open sourced the Data Frames project I worked on for reference:

https://github.com/jackimburgia/Frames/tree/master/src

MedAnd commented 5 years ago

Wonder what the guys over at the MS Trill project think of this initiative?

cc @pgovind @cybertyche @YohDeadfall @eerhardt

cybertyche commented 5 years ago

(small puff of smoke) I HAVE BEEN SUMMONED

I few notes from what I see in the (wonderful) discussion above.

Typed and untyped need not be mutually exclusive

There appears to be a fairly even split between people who want the more concise, untyped DataFrame type and those who want a boxing-free, type-safe DataFrame. I would like to suggest that we may be able to achieve a superposition of having and eating cake here.

For instance, we may be able to have a situation wherein: public sealed class DataFrame<T> : DataFrame

Alternatively, we could have: public sealed class DataFrame : DataFrameBase public sealed class DataFrame<T> : DataFrameBase

But the former would be far better. I think of it as a parallel to the relationship between LambdaExpression and Expression.

I would personally set an extremely high bar for having separate implementations for ML.Net and for DataFrame

The scenarios in ML.Net and those described above have significant overlap. ML.Net has not yet achieved a 1.0 release. As noted above already, there is already a DataTable class (one that has an extraordinary number of shortcomings relative to the needs stated by people here) but if we also have a DataFrame class that's aaaaaaaalmost IDataView but not quite then I think we have failed. To that end, I hope the corefx and ML.Net teams are talking like nowish about where this goes. It need not be that DataFrame and IDataView refer to literally the same class, as it could be like the relationship between DataTable and DataView in the Fx. But there's too much overlap in scenario here for there to be divergent evolution.

Trill could join this ecosystem as well if we had first-class time support

The main difference between an abstraction like DataFrame and IStreamable (Trill's basic abstraction) is support for the progression of time. ML.Net may also be looking to Trill for some of its capabilities such as windowing, so this is likely a good discussion to keep going.

Some of the more untyped-friendly operations can be done in the typed world with a little imagination and some code generation

Adding a column was brought up as a great example of an operation that can be ridiculous to do in a typed environment but much easier over an untyped set. One of the things we did in Trill was implement a set of overloads to the .Select() method that take operations like adding a column and take it from ridiculous down to maybe a mild laugh. It's not as concise as in untyped-world, but still better than otherwise. For adding a column, the result looks something like:

public static IStreamable<TKey, TNew> Select<TKey, TOld, TNew, TField1>( this IStreamable<TKey, TOld> source, Expression<Func<TNew>> initializer, Expression<Func<TNew, TField1>> fieldSelector1, Expression<Func<TOld, TField1>> fieldInitializer1) where TNew : new()

It allows you to do something like this:

IStreamable<TKey, OldType> stream = …; var newStream = stream.Select(() => new NewType(), val => val.NewField, old => NewFieldValue(old));

Again, not perfect, but better than the alternative. Any field that is the same between OldType and NewType in the above is carried over automatically because MAGIC er I mean code generation. Trill's got a few methods that contain magic like that; for instance, Trill streams have Pivot() and Unpivot() methods.

Anyway, those are my first thoughts on the matter. Feel free to conjure me again. (disappears in a smaller puff of smoke)

stephentoub commented 5 years ago

Closing, as work is actively proceeding in https://github.com/dotnet/corefxlab/tree/master/src/Microsoft.Data and further discussion should happen in corefxlab. cc: @eerhardt, @pgovind

allisterb commented 5 years ago

@mungojam @hrichardlee I've been working on a DataFrame library for .NET which implement many of the ideas in this thread and I'd appreciate some feedback when you have time: F#: https://notebooks.azure.com/allisterb/projects/sylvester/html/Sylvester.DataFrame.ipynb C#: https://dotnetfiddle.net/1K9tdy

joshnyce commented 5 years ago

@allisterb This is great. Implementing iEnumerable (making your dataframe implementation linq-friendly) sets it apart from Deedle. I haven't tried the commercial dataframe products for .NET, so I don't know if they are linq-friendly, but it looks like the implementation in corefxlab doesn't implement iEnumerable yet.

dotnet / runtime

DataFrame Discussion #24920

API shape proposal-ish

How is this different from LINQ on a `List<T>` of POCOs?

How is this different from DataTable?

Why should dataframes be added to .NET?

A full example of how dataframes are used in real life

dotnet / runtime

DataFrame Discussion #24920

API shape proposal-ish

How is this different from LINQ on a List<T> of POCOs?

How is this different from DataTable?

Why should dataframes be added to .NET?

A full example of how dataframes are used in real life

How is this different from LINQ on a `List<T>` of POCOs?