Open aslotte opened 5 years ago
cc @pgovind - thoughts?
This may be possible to do by creating a generic version of a DataFrame
What generic would you imagine here?
public class Customer
{
public string FirstName;
public string LastName;
}
DataFrame<Customer> df = DataFrame<Customer>.ReadCsv(customerDataPath);
df.FirstName // this gets me the FirstName column?
Is something like that what you are envisioning?
@eerhardt - yep exactly, something along those lines. The DataFrame class can certainly be a copy-implementation of Pandas in .NET, but it would be really cool I think if we could make a bit more .NET friendly :)
Given that a large crowd that will be using ML.NET and DataFrame come from a .NET background, it would be so neat if we could leverage generic, LinQ and other C# syntax to work with the DataFrame. I certainly can't answer whether that is possible from a performance stand-point though. I would be more than happy to try to help out as time permits.
Interesting. There is a good chance that the DataFrame API could eventually converge to something like this. This is the second time this has come up. First time is here. The DataFrame type seems to be in a space that is an intersection of data scientists and engineers, so there's always been this constant question of how strongly typed the API should be. The reasons I haven't made the APIs very strongly typed yet are:
a) I was targeting a natural DataFrame + Jupyter experience in .NET and we didn't have IntelliSense working in Jupyter yet. I assume this will happen in the future. At that point, there'd be a stronger case for this IMO.
b) I wanted to be conservative with the API initially. For ex: The datasets I've seen for most ML tasks seem to have many columns. On a notebook, defining a schema such as Customer
but with say 15 columns didn't seem natural to me? My impression was that users might be turned off by all the code they'd have to type(especially without IntelliSense/code completion) just to read in a csv file. Without code completion, I figured that a weakly typed API was the way to go since it made code shorter and the intent clearer.
Having said all that, there are places where I realllllly wish I had more type information. The Merge
/Join
APIs or indexing(like you mention here) for ex. It'd also help with AppendRow
and, I'm sure, in other places. I'll keep this in mind as we go along. A generic DataFrame derived from a base DataFrame will fit in nicely for sure with .NET for Spark scenarios.
Thoughts? Feedback? We definitely don't want the API to feel alien to .NET developers!
Thank you for your detailed and long answer @pgovind! That all makes a lot of sense, and I can certainly see why the DataFrame is built as it is today.
With that said, it would be awesome to have the option for both, and that may not be in the road map as yet, and that's okay :) I was thinking about it last night, and it may be possible to achieve it without asking the user to specify the entire schema (which I agree can be a bit ugly).
We should be able to infer the schema based on the first row (if such row exist), and in theory create a new type based on that, and the inferred data types using reflection: https://docs.microsoft.com/en-us/dotnet/api/system.reflection.emit.typebuilder.createtype?view=netframework-4.8 I haven't tried that myself, so not sure if it actually would work.
When it comes to being ".NET friendly", I think the biggest ask I would have would be to enable LinQ queries on the columns or rows, to filter, select and project data. Thank you for all your contributions to this framework!
We should be able to infer the schema based on the first row (if such row exist), and in theory create a new type based on that, and the inferred data types using reflection
@aslotte In the .NET notebook we wouldn't need to use reflection. We're already compiling code submissions, so we could, for example, introduce a magic command that generates and compiles this on demand, after which the generated type would be available for the duration of the notebook session. It might look something like this:
%%compile-dataframe --data c:\housing.csv --type MyData
MyData df = MyData.ReadCsv("c:\housing.csv");
IntColumn populationCol = df.population;
That's a great idea @jonsequitur
@jonsequitur : I missed this comment somehow. How would I go about implementing your suggestion? I'd like to prototype it. Maybe there'd be method to generate MyData
in the DataFrame
library and the magic command would call this method(from where though)?
@pgovind This would be a good fit for the extensibility story for dotnet-interactive
. Since we don't have it documented yet, a quick chat might be the best way to get you started.
FYI, you can try this out by installing the Microsoft.DotNet.Interactive.ExtensionLab
package in a .NET Interactive notebook:
@jonsequitur - any thoughts on contributing that directly to the Microsoft.Data.Analysis package? Then users don't need to install both?
https://github.com/dotnet/corefxlab/tree/master/src/Microsoft.Data.Analysis.Interactive
When using the DataFrame object, the current way to retrieve columns are by the df["ColumnName"]
It would be very neat to instead be able to get a property by executing df.ColumnName as a property. This may be possible to do by creating a generic version of a DataFrame, e.g.
var df = new DataFrame().ReadCsv(filePath)
I understand that the ReadCsv method currently is static, so not sure if this breaks a paradigm.