dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.94k stars 1.86k forks source link

DataFrame (Microsoft.Data.Analysis) Tracking Issue #6144

Open luisquintanilla opened 2 years ago

luisquintanilla commented 2 years ago

Summary

This issue tracks priorities and discussions around DataFrame improvements based on issues and feedback.

Microsoft.Data.Analysis Open Issue Query: https://github.com/dotnet/machinelearning/issues?q=is%3Aopen+is%3Aissue+label%3AMicrosoft.Data.Analysis

Work Items

Related Issues

Create DataFrames

Data Formatting

Data Sources

Other

Reshape DataFrames

Filter / Sort DataFrames

Combine DataFrames

Group DataFrames

Summarize DataFrames

Handle Missing Data

DataTypes

Array / Vector / VBuffer

DateTime

Other

Misc

Bigwiggz commented 2 years ago

Can you please finish the dataframe? I realize it is not a high priority but this would be a great addition to .net even just for data-wrangling. Also, Is there a way to convert the dataframe to a C# List of a Custom Class?

Bigwiggz commented 2 years ago

I was wondering if there could be a LoadExcel Method for the dataframe for .xlsx or .xls files like Pandas has?

luisquintanilla commented 2 years ago

Can you please finish the dataframe? I realize it is not a high priority but this would be a great addition to .net even just for data-wrangling. Also, Is there a way to convert the dataframe to a C# List of a Custom Class?

Not built-in, but one way you might want to try doing it is by accessing the DF rows. Something like:

image

Bigwiggz commented 2 years ago

Thanks!

nodakai commented 2 years ago

Can you please add first-class support for time series analysis? I see lots of outcry in the GH Issues.... First of all DateTime columns are unusable in VS code due to #5698

MikaelUmaN commented 2 years ago

Hey, This is a great summary and a good list of features.

I am not sure why this doesn't seem to have higher priority. I can tell you that from an industry perspective, a proper dataframe solution - especially involving time series functionality - is a prerequisite for dotnet to be competetive in the analysis space. I can't leave python even if I wanted to.

Great list, hope to see this implemented!

artemiusgreat commented 2 years ago

Looks like neither DataFrame, nor IDataView, has an ability to convert types of vectors. Developers can convert column of String type to Single type, but can't convert String[] to Single[].

I'm trying to use vectors of variable length as features. The data view with features.

Label  |  Feature1  |  Feature2
--------------------------------
1      |  5#15#25   |  25#75
2      |  10        |  100#65#0
3      |  115#90    |  5#80

Using pipeline estimator TokenizeIntoWords, I can split selected features into a vector of words.

var pipeline = Context
  .Transforms
  .Conversion
  .MapValueToKey("Output", "Label")
  .Append(Context.Transforms.Text.TokenizeIntoWords("Tokens", "Feature1", new[] { '#' }))

After this, how can I specify that Tokens is a vector of numbers, not strings? Also, how do I normalize this vector, considering that this is a row-based vector and standard ML transformers, like NormalizeMinMax, apply to column-based data?

artemiusgreat commented 2 years ago

The implementation is pretty bad, not thread-safe, not GC friendly, etc, but this is what would be nice to have available in DataFrame. Besides support for vector types, would be nice to add more flexible custom mapping - Converter that doesn't enforce to specify input and output type.

using Microsoft.ML;
using Microsoft.ML.Data;
using System;
using System.Collections;
using System.Collections.Generic;
using System.Linq;

namespace DemoSpace
{
  public class View : IDataView
  {
    protected IDictionary<string, Func<object, object>> _cache = new Dictionary<string, Func<object, object>>();

    /// <summary>
    /// Accessors
    /// </summary>
    public virtual IList Items { get; set; }
    public virtual DataViewSchema Schema { get; set; }
    public virtual IDictionary<string, DataViewType> Columns { get; set; }
    public virtual Func<IEnumerator, DataViewSchema.Column, dynamic> Converter { get; set; }

    /// <summary>
    /// View implementation
    /// </summary>
    public virtual bool CanShuffle => false;
    public virtual long? GetRowCount() => Items.Count;
    public virtual DataViewRowCursor GetRowCursor(IEnumerable<DataViewSchema.Column> columns, Random seed = null) => new ViewCursor(this);
    public virtual DataViewRowCursor[] GetRowCursorSet(IEnumerable<DataViewSchema.Column> columns, int o, Random seed = null) => new[] { GetRowCursor(columns, seed) };

    /// <summary>
    /// Constructor
    /// </summary>
    /// <param name="columns"></param>
    /// <param name="items"></param>
    public View(IDictionary<string, DataViewType> columns, IList items)
    {
      var schemaBuilder = new DataViewSchema.Builder();

      foreach (var column in columns)
      {
        schemaBuilder.AddColumn(column.Key, column.Value);
      }

      Items = items;
      Columns = columns;
      Schema = schemaBuilder.ToSchema();
    }

    /// <summary>
    /// Converter
    /// </summary>
    /// <param name="enumerator"></param>
    /// <param name="column"></param>
    /// <returns></returns>
    public virtual dynamic ConverterImplementation(IEnumerator enumerator, DataViewSchema.Column column)
    {
      var name = column.Name;

      if (_cache.ContainsKey(name) is false)
      {
        _cache[name] = enumerator.Current.GetType().GetProperty(name).GetValue;
      }

      var value = _cache[name](enumerator.Current);

      if (value is string)
      {
        return $"{ value }".AsMemory();
      }

      if (value is IList)
      {
        var items = value as IList;
        var itemType = items.GetType().GetElementType();

        switch (true)
        {
          case true when Equals(itemType, typeof(int)): return new VBuffer<int>(items.Count, (items as IList<int>).ToArray());
          case true when Equals(itemType, typeof(float)): return new VBuffer<float>(items.Count, (items as IList<float>).ToArray());
          case true when Equals(itemType, typeof(double)): return new VBuffer<double>(items.Count, (items as IList<double>).ToArray() );
          case true when Equals(itemType, typeof(string)): return new VBuffer<ReadOnlyMemory<char>>(items.Count, (items as IList<string>).Select(o => $"{o}".AsMemory()).ToArray());
        }
      }

      return value;
    }

    /// <summary>
    /// Cursor
    /// </summary>
    public class ViewCursor : DataViewRowCursor
    {
      protected int _index = -1;
      protected View _view = null;
      protected IEnumerator _enumerator = null;

      /// <summary>
      /// Cursor implementation
      /// </summary>
      public override long Batch => 0;
      public override long Position => _index;
      public override DataViewSchema Schema => _view.Schema;
      public override bool IsColumnActive(DataViewSchema.Column column) => true;
      public override ValueGetter<DataViewRowId> GetIdGetter() => (ref DataViewRowId id) => id = new DataViewRowId();

      public ViewCursor(View view)
      {
        _view = view;
        _enumerator = _view.Items.GetEnumerator();
      }

      public override ValueGetter<TValue> GetGetter<TValue>(DataViewSchema.Column column)
      {
        return (ref TValue value) => value = _view.Converter(_enumerator, column);
      }

      protected override void Dispose(bool disposing) => base.Dispose(disposing);

      public override bool MoveNext()
      {
        if (_enumerator.MoveNext())
        {
          _index++;

          return true;
        }

        Dispose();

        return false;
      }
    }
  }
}

Usage example

var columns = new Dictionary<string, DataViewType>
{
  ["Id"] = NumberDataViewType.Int32,
  ["Name"] = TextDataViewType.Instance,
  ["Ints"] = new VectorDataViewType(NumberDataViewType.Int32),
  ["Doubles"] = new VectorDataViewType(NumberDataViewType.Double),
  ["Strings"] = new VectorDataViewType(TextDataViewType.Instance),
};

var items = new List<object>
{
  new { Id = 1, Name = "A", Doubles = new double[] { 10, 200 }, Ints = new int[] { 1, 2, 3 }, Strings = new string[] { "" } },
  new { Id = 2, Name = "B", Doubles = new double[] { 5, 20, 15 }, Ints = new int[] { 1 }, Strings = new string[] { "" } },
  new { Id = 3, Name = "C", Doubles = new double[] {}, Ints = new int[] { }, Strings = new string[] { "" } }
};

var view = new View(columns, items);

view.Converter = view.ConverterImplementation;  // Use predefined converter
view.Converter = (enumerator, column) =>        // Or define a custom one
{
  if (enumerator.Current is double)
  {
    return enumerator.Current.GetType().GetProperty(name).GetValue(enumerator.Current) as double;
  }

  return enumerator.Current;
}; 

var x = view.Preview();
var x1 = view.GetColumn<string>("Name");
chriss2401 commented 1 year ago

Hi @luisquintanilla , thank you for this overview. Would you consider adding my issue as well ? https://github.com/dotnet/machinelearning/issues/5652

Edit: And also maybe this one https://github.com/dotnet/machinelearning/issues/5656

MgSam commented 1 year ago

The latest ML.NET blogpost briefly talks about some investment in DataFrame. Are you guys planning on having one or more people work full time on this project (as it so desperately requires)?

luisquintanilla commented 1 year ago

Hi @MgSam,

Thanks for pointing out the mention to DataFrame 🙂

Unfortunately at this time I don't have additional details to share on who or how many people will be working on the areas outlined in the blog post:

What I can say is that we will continue making improvements to the APIs over the next few months across those themes. We've made some progress over the past year by adding DateTime support and soon VBuffer support as well.

Although this list looks conservative and doesn't reflect the work that's still needed on these APIs we want to make sure we continue to make progress while still setting accurate expectations in what we expect to deliver.

That being said, we are happy to work with the community to help speed up this process in helping the DataFrame be the best it can be.

With that in mind, would you be interested in contributing to the DataFrame and if so, how can we help you be successful in making those contributions?

ingted commented 1 year ago

Without a comparable dataframe, ML.NET lose at Pandas phase... People even haven't yet start their machine learning experience in ML.NET and then give up and embrace Python...

ingted commented 1 year ago

How even there is no way (I know I could perform row base operation by myself) to insert the dataframe into database but only saveCSV... OMG

TheJanzap commented 1 year ago

I'm not sure if this is covered by https://github.com/dotnet/machinelearning/issues/6499 , but one of my annoyances when working with DataFrames is accessing values in a DataFrameRow. These are always returned as Object instead of the type specified by the DataFrameColumn, so I need to cast them back to their original type on every access.

It would be great if this could be improved, would lead to less cluttered code!

asmirnov82 commented 1 year ago

With that in mind, would you be interested in contributing to the DataFrame and if so, how can we help you be successful in making those contributions?

Hello @luisquintanilla, I am interested in contrinuting to the DataFrame and already did some PRs. Could you please review?

6698

6681

6678

6677

6676

6675

luisquintanilla commented 1 year ago

With that in mind, would you be interested in contributing to the DataFrame and if so, how can we help you be successful in making those contributions?

Hello @luisquintanilla, I am interested in contrinuting to the DataFrame and already did some PRs. Could you please review?

6698 #6681 #6678 #6677 #6676 #6675

@asmirnov82 Thanks so much for those contributions! Apologies for the delay. We'd been busy getting our latest release published 🙂. We'll take a look at those PRs.

asmirnov82 commented 10 months ago

Hello, @luisquintanilla

More than a year passed since this roadmap was initially posted and there are only two months before the planned release in the mid of November. It seems practically impossible for me, that remaining tasks in this list will be completed in time.

That's why I am wondering if you have any news or insights about future plans for ML.NET and the DataFrame?

What are current priorities for these libraries? Do you have any thoughts about the way how development of ML.Net and DataFrame is going to be in 2024?

Several years ago DataFrame was developed with the aim to became a part of .Net Core framework, than it was moved into ML.NET repository and currently it’s not developed very actively.

Microsoft seems to switch all efforts to prompt engineering and ChatGPT-like stuff. In your opinion. is it just a temporary switch or a permanent trend?

I still see importance in improving DataFrame library and high demand for something similar to Pandas, but written in C# by .Net programmers . There are 1.2K daily downloads of Microsoft.Data.Analysis nuget package, despite the fact that previous version is very bugy, has limited functionality and prerelease state. I am sure, that it's usage will dramaticaly increased in case of development is actively ongoing.

Currently I practically finished fixing the most critical bugs, that were found in previous versions of the DataFrame (like incorrect working with NULL values, limitation in 2 Gb, issues with csv reading and so on) and currently am going to concentrate my efforts on improving DataFrame performance.

I created an EPIC, where I listed all issues existed before that are related to performance and created several new:

6824

Could you please take a look and provide your thoughts? I would like to have any feedback before I start development.

As according to your question: “With that in mind, would you be interested in contributing to the DataFrame and if so, how can we help you be successful in making those contributions?”

At this point for me the main showstopper is the time required for my PR’s to be reviewed and merged into main.

I try to create small PRs, so each change to be more or less obvious and easy to review (keeping this in mind I split task for improving performance into the list of subtasks). The disadvantage of this approach is that I quite often depend on previously finished tasks to continue development. And I need the change to be merge into main to avoid future merge conflicts and continue working on the task (or I have to switch to absolutely different task, like switching from dataframe arithmetics to loading datafrane from csv file). And if my PRs are not merged for a week or two – I just have to stop any development and wait.

Would it be possible to have more people dedicated to reviewing PRs? Current situation with PRs prevent community from actively involve in product development.

Another issue is flaky unit tests. Some unit tests are failing from time to time even in main branch without any changes, so it requires to rerun build in order to get everything green for 2-3 or even more times.

ingted commented 10 months ago

Can't agree ANYMORE!!! We need a good DF to work in the data science field .net eco system.

MikaelUmaN commented 10 months ago

Agree 100000000%

mungojam commented 10 months ago

I try to create small PRs, so each change to be more or less obvious and easy to review (keeping this in mind I split task for improving performance into the list of subtasks). The disadvantage of this approach is that I quite often depend on previously finished tasks to continue development. And I need the change to be merge into main to avoid future merge conflicts and continue working on the task (or I have to switch to absolutely different task, like switching from dataframe arithmetics to loading datafrane from csv file). And if my PRs are not merged for a week or two – I just have to stop any development and wait.

Well done and thanks for all your efforts. One thing that might help with this PR issue is to branch off from your unmerged earlier work and create draft PRs that target those branches earlier branches. GitHub automatically updates the target branch to main once the earlier ones are merged and you don't get blocked and only need to resolve merge conflicts once and then flow them through the dependent branches

pmcgeebcit commented 9 months ago

I agree with the comments from asmirnov82, ingted and MikaelUmaN. More focus on the DataFrame is essential. It is such a fundamental piece that enables data models. In my opinion DataFrame development should be a top priority if Microsoft is serious about enabling machine learning. I like a lot of what Microsoft has done with their toolset and think DataFrame improvement would be an enormous boost.

IntegerMan commented 7 months ago

I'd like an easier way of getting values for a given column in a DataFrame into an array of values. This is a need related to providing a sequence of values to various charting libraries, particularly in Polyglot Notebooks.

It's possible, but it's not pretty (nor is the chart that rendered, but that's my fault in this example): image

I'd love .ToList() and .ToArray() methods that could more easily pull data out of a column, or just more fluent ways of doing this than are present today.

I think I can work around this for now and for a book project I'm taking on by writing a small extension method, but I'd like there to be something bolted in

This is related to the issue @jonsequitur linked above in the interactive repo.

asmirnov82 commented 7 months ago

Hi @IntegerMan , there is an easier way to pull data out of a column. As each column implements IEnumerable<T?> you can use

var dataX = ((SingleDataFrameColumn)df["Credit Amount"] ).ToArray();

However, there are several issues with this approach. First issue is related to using Single data type (not Double)) in your example. SingleDataFrameColumn stores float values, that you have to explicitly convert to doubles to pass them to ScottPlot. Second issue is that DataFrame allows to store Nulls, so actual type of dataX is float?[]. Actually, using Select( x => (double) x).ToArray() may fail with System.InvalidOperationException "Nullable object must have a value." in case of any Null value inside your column.

So my suggestion is:

1) Use DoubleDataFrameColumn instead of SingleDataFrameColumn 2) Write your own small extension method for IEnumerable<T?>

public static T[] ConvertToArray<T>(this IEnumerable<T?> enumerable)
    where T : struct
{
    return enumerable.Select(x => x ?? default).ToArray();
}

After that you will be able to simplify your code to:

var dataX = ((DoubleDataFrameColumn)df["Credit Amount"] ).ConvertToArray();
var dataY = ((DoubleDataFrameColumn)df["Age"] ).ConvertToArray();

var plt = new ScottPlot.Plot(300, 400);
plt.AddScatter(dataX, dataY);
lucvalentino commented 3 months ago

I could not find a way to apply lambdas to DF or to results of GroupBy (like the Apply function in Pandas). In particular, after a GroupBy the only way to apply lambdas to the groups and columns is to loop through the groups and add the new transformed rows into a new DF. Something like this:

        var dataFrameGroupBy = dataFrame.GroupBy<int>("Dimension1");

        var cols = new List<DataFrameColumn>
        {
            new Int32DataFrameColumn("Identifier"),
            new DoubleDataFrameColumn("Value1"),
            new DoubleDataFrameColumn("Value2"),
            new Int32DataFrameColumn("Dimension1"),
            new Int32DataFrameColumn("Dimension2")
        };

        var df = new DataFrame(cols);

        foreach (var group in dataFrameGroupBy.Groupings)
        {
            int id = 1;
            var value1= group.Max<DataFrameRow, double>(each => (double)each["Value1"]);
            var value2= group.Min<DataFrameRow, double>(each => (double)each["Value2"]);
            var key = group.Key;
            var dim2 = group.Max<DataFrameRow, int>(each => (int)each["Dimension2"]);

            df.Append(new KeyValuePair<string, object>[]{new("Identifier", id), new("Value1", value1), new("Value2", value2), new ("Dimension1", key), new ("Dimension2", dim2)}, inPlace: true);
        };

Is there a better was to achieve this? Is there any plan to implement an Apply method?

asmirnov82 commented 3 months ago

Hi @lucvalentino. Dataframe columns have two methods that take a lambda and apply it to the column content: ApplyElementwise and Apply. However, as lambda function is generic, these method are defined on the level of the inheritors of the base DataFrameColumn class.

For example for PrimitiveDataFrameColumn<T>:

void ApplyElementwise(Func<T?, long, T?> func)
PrimitiveDataFrameColumn<TResult> Apply<TResult>(Func<T?, TResult?> func) where TResult : unmanaged

For ArrowStringDataFrameColumn:

Apply(Func<string, string> func)

For some reason StringDataFrameColumn and VBufferDataFrameColumn don't support such functionality.

ApplyElementwise applies lambba in place and Apply method returns new column as a result and allows result to have different underlying data type.

As regarding grouping, unfortunately it seems that current implementation of GroupBy class (result of the DataFrame.GroupBy method) doesn't allow to specify different aggregation functions to different columns. I think the simpliest way to achive expected result is to use code like this:

var dataFrameGroupBy = dataFrame.GroupBy<int>("Dimension1");
var max = dataFrameGroupBy.Max("Value1", "Dimension2");
var min = dataFrameGroupBy.Min("Value2");
var res = max.Merge<int>(min, "Identifier", "Identifier");

These implementation can be very CPU and memory consuming for large dataframe, so I absolutely agree with you that current GroupBy API requires redesign.

lucvalentino commented 3 months ago

Thanks @asmirnov82 for your prompt reply and clarifications!

I saw those Apply and ApplyElementwise methods at column level, and also realized that they were not available for StringDataFrameColumn. But I thought there was a way to apply lambdas directly to DFs similarly to Pandas or PySpark.

It would be nice to make the Apply and ApplyElementwise available to all column types.

And yes, the GroupBy would need some redesign, with the current design the user gets somehow stuck after that operation.

But hey, thanks to all developers contributing to this project!