dotnet / corefxlab

This repo is for experimentation and exploring new ideas that may or may not make it into the main corefx repo.
MIT License
1.46k stars 345 forks source link

Column name indexing removed in .4? #2934

Closed JamesAlexander42 closed 4 years ago

JamesAlexander42 commented 4 years ago

Why was indexing into a DataFrame removed in latest? Looking at the commit history too I can see the block was deleted. This makes very an awkward use experience now.

pgovind commented 4 years ago

Hello! Which API are you talking about? I still see this: https://github.com/dotnet/corefxlab/blob/f44a099c2a3bb8b0feedc92cdf9f66aba793a82c/src/Microsoft.Data.Analysis/DataFrame.cs#L72

JamesAlexander42 commented 4 years ago

I referred above the change I'm referring to

pgovind commented 4 years ago

Ah I see. It still exists! We just moved that to the DataFrameColumnCollection class. See this for an example: https://github.com/dotnet/corefxlab/blob/f44a099c2a3bb8b0feedc92cdf9f66aba793a82c/tests/Microsoft.Data.Analysis.Tests/DataFrameTests.cs#L648

rhysparry commented 4 years ago

I think @Hamaze was specifically asking why the API was changed that way. Maybe you can point to the API review?

I know that I'd rather write:

df["Int3"] = df["Int1"] * 2 + df["Int2"];

As opposed to:

df.Columns["Int3"] = df.Columns["Int1"] * 2 + df.Columns["Int2"]; 
JamesAlexander42 commented 4 years ago

Yeah, the former is more pandas-esque and comfortable IMO.

MikaelUmaN commented 4 years ago

Agreed.

I suspect the reason is for the row filter to work.

But you are more often interested in column selection than row selection so it's better to have the penalty there instead.

df.Rows[2 .. 12]

df["Int3"] = df["Int1"] * 2 + df["Int2"];
pgovind commented 4 years ago

Just out of curiosity, are you using DataFrame in a notebook? Reason I ask is that we've worked on a cool extension for DataFrame in notebooks that'll let you write df.Int1 * 2 + df.Int2. To be specific, with the new extension you can now refer to a column as a field of a DataFrame object. With intellisense enabled in notebooks, this will be very discoverable.

JamesAlexander42 commented 4 years ago

I'm not using it in a notebook context for this exercise. Using it in an asp.net app.

MikaelUmaN commented 4 years ago

Haven't tested the new notebook support yet but will do.

I would say that even though notebooks are very useful, I much prefer the experience to be the same when doing normal software and when doing notebooks.

Usually I prototype in notebooks and then structure and copy stuff to some kind of software that is more production-like. So I would avoid using any extensions in notebook except for ones that are interactive such as plotting.

zyzhu commented 4 years ago

I concur with @MikaelUmaN.

For instance, I would expect code that runs in F# kernel notebook can be run in FSI under Visual Studio directly. I would also expect it to be compiled to be part of a bigger production system mixed with C# and F#. That's how I explore my problems in Ifsharp notebook and put them in production all the time.

However, if syntax involving dataframe relies on an extra notebook extension that only works in notebook, the beauty of production-ready scripts is no longer feasible.

cc @cartermp @dsyme to chime in.

pgovind commented 4 years ago

Just tagging @eerhardt for visibility here. This is great feedback! We're busy helping out with .NET 5 stuff this week, but I'll revisit this next week. There's enough support here to consider bringing back the column name indexer on DataFrame.

eerhardt commented 4 years ago

There's enough support here to consider bringing back the column name indexer on DataFrame.

I agree. Personally I like the ease of use of df["Int1"] as well, so I'm glad I'm not alone.

It should be pretty easy to add the API back as a wrapper over the .Columns[string] indexer, and a test or two. Anyone want to make a PR for that?

dsyme commented 4 years ago

Just out of curiosity, are you using DataFrame in a notebook? Reason I ask is that we've worked on a cool extension for DataFrame in notebooks that'll let you write df.Int1 * 2 + df.Int2. To be specific, with the new extension you can now refer to a column as a field of a DataFrame object. With intellisense enabled in notebooks, this will be very discoverable.

However, if syntax involving dataframe relies on an extra notebook extension that only works in notebook, the beauty of production-ready scripts is no longer feasible.

Yes, we need to be very careful about promoting non-standard extensions to the programming model for C# or F# which are only deployed only through select channels. Notebook programming should ideally not be using variations of these programming languages, though these things are subtle

This is a tricky area because there is a notable tendency to use the incremantal-dynamicity of notebook programming

@pgovind What APIs are you using to craft this language variation? Please discuss this with @MadsTorgersen, @jaredpar and myself. We can't have random variations on C# and F# floating around that fragment the overall programming experience.

pgovind commented 4 years ago

So, just to be clear, the extension I'm talking about here is only a prototype to explore the dotnet-interactive extensions APIs. There's no immediate plans to productize it right now, and we definitely don't want to create fragmentation. It lives here: https://github.com/dotnet/interactive/blob/main/src/Microsoft.DotNet.Interactive.ExtensionLab/DataFrameTypeGeneratorExtension.cs

What APIs are you using to craft this language variation? It's not really a variation. It's a prototype right now (and not part of the type itself). Basically, given a DataFrame object, it looks at the types of the columns and spits out code to create a new SomeNameDataFrame type with the column names as properties. This code is then compiled on demand and the dotnet-interactive shell then exposes this type for use in the notebook.

dsyme commented 4 years ago

@pgovind The problem is that this sort of "generating API from dynamic data" is a completely new thing in the .NET universe (the closest thing is F# type providers, and then source generators, though those are normally part of the static toolchain).

It doesn't really fit any existing part of the existing C#/F#/.NET programming model and can never really be incorporated into project-based programming, for example. It can only be done in notebook-like environments that assume a complete compiler toolchain at each stage of execution, even in production scripts.

It's a powerful thing to be sure but we have to be aware of the direction this is going. I understand why you're thinking of doing this but yes, fragmentation of the programming experience is an intrinsic part of this direction, as tempting as it is.

An approach that does fit within existing norms is to drive the code generation off some kind of static schema (declared or acquired).