dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9k stars 1.88k forks source link

Question : Exploration through Fsharp + DataFrame types #3692

Closed ykafia closed 2 years ago

ykafia commented 5 years ago

Will ML.NET have an API for data exploration ?

By data exploration i mean statistics, selections, filters of dataframe like objects.

glebuk commented 5 years ago

@Jaygem, ML.NET is a general machine learning framework, similar to Scikit-Learn, not a data manipulation library like Pandas.. The inputs for it are either a file, SQL, IDataView, or IEnumerable<T> It is designed from the ground up to work on streaming data. As a result, it's not really designed to be equivalent to Pandas in Python. There are other frameworks in C# to do this, such as LINQ and perhaps some 3rd party libraries such as Deedle (I have not used the latter one so I have no opinion on it.)

eerhardt commented 5 years ago

@Jaygem - Check out the discussion at https://github.com/dotnet/corefx/issues/26845. We are working on a prototype of a .NET DataFrame type in corefxlab:

https://github.com/dotnet/corefxlab/tree/master/src/Microsoft.Data and some of the PRs happening in that space:

https://github.com/dotnet/corefxlab/pull/2656 https://github.com/dotnet/corefxlab/pull/2660

We would really appreciate any feedback/contributions/etc in this space. If you'd like to check it out, please let us know if you find it useful. cc @pgovind

pgovind commented 5 years ago

Yup, we're looking for feedback in this space. Please feel free to comment on the original issue or on the PRs. If you deal with pandas/dataframes/data science everyday, that would be super helpful too to get input on how the DataFrame type is shaping up.

veikkoeeva commented 5 years ago

My apologies if this is the wrong thread, but I'm not sure if extending the CoreFX discussion is appropriate for this and maybe this is some perspective here. The Arrow data mentioned in the long thread indeed tries to work around problems in data presentation. The larger problem in scientific community, as far as I understand, is object storage, new data and how to stream it effectively when the formats are, well, what they are.

So in that sense Array and time series data and streaming of data are the way to go. I would like to draw a bit attention to Zarr too, like at https://medium.com/pangeo/continuously-extending-zarr-datasets-c54fbad3967d

The Pangeo Project has been exploring the analysis of climate data in the cloud. Our preferred format for storing data in the cloud is Zarr, due to its favorable interaction with object storage. Our first Zarr cloud datasets were static, but many real operational datasets need to be continuously updated, for example, extended in time. In this post, we will show how we can play with Zarr to append to an existing archive as new data becomes available. The problem with live data

Earth observation data which originates from e.g. satellite-based remote sensing is produced continuously, usually with a latency that depends on the amount of processing that is required to generate something useful for the end user. When storing this kind of data, we obviously don’t want to create a new archive from scratch each time new data is produced, but instead append the new data to the same archive. If this is big data, we might not even want to stage the whole dataset on our local hard drive before uploading it to the cloud, but rather directly stream it there. The nice thing about Zarr is that the simplicity of its store file structure allows us to hack around and address this kind of issue. Recent improvements to Xarray will also ease this process.

As for an example, the new ESA datahub works around this a bit so that albeit the files are about 100 MiB chunks of netCDF (data organized in HDF5 inside them, I think), they have an OData API that allows slicing inside of those files to retrieve some specific dimensions with some time ranges. The dimensions are vectors of values in binary, usually some other vector is needed to make sense of the data (e.g. points in time, coordinates).

It looks to me some people are coming across a similar kind of a problem when using Orleans: new data is generated, it needs to be stored/appended and hot data separated from the cold data (but occasionally one fetches cold data too). Then some processing is handling stream data and considerations about using AI too.

Also interesting might be https://medium.com/pangeo/step-by-step-guide-to-building-a-big-data-portal-e262af1c2977 .

allisterb commented 5 years ago

I've been working on a DataFrame library for F# using the DLR: https://notebooks.azure.com/allisterb/projects/sylvester/html/Sylvester.DataFrame.ipynb

luisquintanilla commented 2 years ago

Thanks everyone for the discussion and feedback. Since the last comments on this thread, we've introduced the Microsoft.Data.Analysis library which brought DataFrames to .NET.

We plan on making improvements to the library and are tracking feedback and progress in this issue #6144

For samples on using the library, check out the sample notebook.

Closing this issue.