dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.02k stars 1.88k forks source link

Sample for data from relational database? #1130

Closed thracian2015 closed 5 years ago

thracian2015 commented 6 years ago

Do we have a sample that demonstrates how to use ML.NET with data from a relational database, such as SQL Server? I don't need a complete sample but just how to pass such a dataset, instead of using TextLoader?

Also, from what I can see, the samples use LearningPipeline which I've read is deprecated. Do we have a sample that demonstrates the new API?

jwood803 commented 6 years ago

The README was just updated to include an example of the new API and to the cookbook that goes over it in some more detail.

Hope that helps!

thracian2015 commented 6 years ago

Thank you! This helps but do we have a sample demonstrating how to use data from a relational database? Is SqlDataReader good enough since it implements IEnumerable or is there a more efficient way?

jwood803 commented 6 years ago

If you want it for the learning pipeline, then the CollectionDataSource can take in an IEnumerable to be passed into the pipeline. I'm unsure if this is an option in the new API, however.

CESARDELATORRE commented 6 years ago

With the new API (Pigsty/Static/Typed) you do it like the following:

var env = new LocalEnvironment(); // Load data from a database like using Entity Framework or any other way IEnumerable myData = GetDataFromDB();

// Turn the data into the ML.NET IDataView. // Can use CreateStreamingDataView for IEnumerable or CreateDataView for IList var trainData = env.CreateStreamingDataView(myData);

More info here: https://github.com/dotnet/machinelearning/blob/master/docs/code/MlNetCookBook.md#what-if-my-training-data-is-not-in-a-text-file

thracian2015 commented 6 years ago

Thanks Cesar but apparently is more than that, which is why I asked for a sample. SqlDataReader reader = command.ExecuteReader(); DataTable dt = new DataTable(); dt.Load(reader); var trainData = env.CreateStreamingDataView(dt.AsEnumerable()); Exception "Could not determine an IDataView type for member RowState Parameter name: rawType'"

Do I need to cast each row to a custom class?

thracian2015 commented 6 years ago

I got it to work by using a custom list but now I have another issue. Basically, what I'm trying to predict is the customer probability to purchase a product. My features are demographic information, such as YearlyIncome, Education, Occupation. Note that some fields are integers or floats. Upon training the classification model, I get a 'Schema mismatch' error.

            trainData = env.CreateDataView(customers);
            var classification = new BinaryClassificationContext(env);
            var dynamicLearningPipeline = new CategoricalEstimator(env, "ProductBuyer")
                .Append(new ConcatEstimator(env, "Features", "MaritalStatus", "Gender", "YearlyIncome", "TotalChildren", "Education", "Occupation", "HouseOwnerFlag", "NumberCarsOwned", "CommuteDistance", "Age"))
                .Append(new FastTreeBinaryClassificationTrainer(env, "ProductBuyer", "Features", numTrees: 20));

var dynamicModel = dynamicLearningPipeline.Fit(trainData); ***Exception 'Schema mismatch for input column 'YearlyIncome': expected Text, got R4

Zruty0 commented 6 years ago

@thracian2015 , ML.NET learners expect features to come in the form of a float vector. Anything that's not a float vector will have to be explicitly converted, this process is known as 'feature extraction'.

Please take a look at our new cookbook for samples of how you can do this.

Zruty0 commented 6 years ago

In another note, using a custom list was the right choice at the time. We planned to have a SQL reader for a long time, but it hasn't yet materialized.

thracian2015 commented 6 years ago

Thanks Peter but I'm still lost. Your cookbook shows how to vectorize features using the TextLoader reader extensions. I'm not using TextLoader but DataReader (SqlDataReader to be precise). Do I need to decorate my custom class fields somehow to convert them to float vectors?

If it's easier, can I send the code sample to someone to take a look at?

veikkoeeva commented 6 years ago

@thracian2015 I think the part IEnumerable myData = GetDataFromDB(); is relevant in that you can take the result from DB as a IEnumerable<T>. Using SqlDataReader is a bit tedious unless you have used it before, so perhaps Entity Framework or Dapper would be a better option. If you want to use ADO.NET, an example could be https://blogs.msdn.microsoft.com/adonet/2012/07/15/using-sqldatareaders-new-async-methods-in-net-4-5-part-2-examples/ (see also part 1) -- or https://docs.microsoft.com/en-us/dotnet/framework/data/adonet/retrieving-data-using-a-datareader.

A bit tangential, but if you have a lot of data, you can pull data out from the DB record-by-record (chunk of records by chunk of records to be more precise) and have efficiency gains.

<edit: A fuller example looks like https://github.com/dotnet/orleans/blob/master/src/AdoNet/Shared/Storage/RelationalStorage.cs#L214 and some sugaring https://github.com/dotnet/orleans/blob/master/src/AdoNet/Shared/Storage/RelationalStorageExtensions.cs#L186. Notably IAsyncEnumerable isn't there, but maybe it shows how it could be added. It's not much code.

thracian2015 commented 6 years ago

@veikkoeeva I'm past loading data. That part works now with a custom list. What I have a problem with now is training the data. See my reply to @Zruty0

Zruty0 commented 6 years ago

I think in your dataset both MaritalStatus and Gender are strings, so you should have passed them through the CategoricalEstimator (together with all other categorical features).

The Concat should be concatenating floats, not columns of mixed types.

Vijay27anand commented 5 years ago

I got it to work by using a custom list but now I have another issue. Basically, what I'm trying to predict is the customer probability to purchase a product. My features are demographic information, such as YearlyIncome, Education, Occupation. Note that some fields are integers or floats. Upon training the classification model, I get a 'Schema mismatch' error.

            trainData = env.CreateDataView(customers);
            var classification = new BinaryClassificationContext(env);
            var dynamicLearningPipeline = new CategoricalEstimator(env, "ProductBuyer")
                .Append(new ConcatEstimator(env, "Features", "MaritalStatus", "Gender", "YearlyIncome", "TotalChildren", "Education", "Occupation", "HouseOwnerFlag", "NumberCarsOwned", "CommuteDistance", "Age"))
                .Append(new FastTreeBinaryClassificationTrainer(env, "ProductBuyer", "Features", numTrees: 20));

var dynamicModel = dynamicLearningPipeline.Fit(trainData); ***Exception 'Schema mismatch for input column 'YearlyIncome': expected Text, got R4

@thracian2015 can you show me how you used the Datatable to custom list to make it work with the CreateDataView

thracian2015 commented 5 years ago

Here is the code. Please post back if you get it to work:

        List<Customer> customers;
        IDataView trainData; 
        SqlConnection connection = new SqlConnection("Server=.;Initial Catalog = AdventureWorksDW2012;Integrated Security=true;");

        var env = new LocalEnvironment();

        // Turn the data into the ML.NET IDataView.
        // Can use CreateStreamingDataView for IEnumerable or CreateDataView for IList
        // https://stackoverflow.com/questions/1464883/how-can-i-easily-convert-datareader-to-listt
        //https://stackoverflow.com/questions/6939054/best-method-to-use-idatareader-as-ienumerablet

        //DataTable dt = new DataTable();
        //dt.Load(reader);

        //reader.Close();
        //DataRow[] rows = dt.Select();

        using (connection)
        {
            SqlCommand command = new SqlCommand("SELECT TOP 80 PERCENT * FROM vTargetMail;", connection);
            connection.Open();
            SqlDataReader reader = command.ExecuteReader();
            customers = Customer.FromReader(reader);

            reader.Close();

            // Turn the data into the ML.NET data view.
            // We can use CreateDataView or CreateStreamingDataView, depending on whether 'churnData' is an IList, or merely an IEnumerable.
            trainData = env.CreateDataView(customers);

            var classification = new BinaryClassificationContext(env);
            var dynamicLearningPipeline = new CategoricalEstimator(env, "BikeBuyer")
                .Append(new ConcatEstimator(env, "Features", "MaritalStatus", "Gender", "YearlyIncome", "TotalChildren", "Education", "Occupation", "HouseOwnerFlag", "NumberCarsOwned", "CommuteDistance", "Age"))
                //.Append(new ConcatEstimator(env, "Features", "MaritalStatus", "Gender", "Education", "Occupation", "HouseOwnerFlag", "CommuteDistance"))
                .Append(new FastTreeBinaryClassificationTrainer(env, "BikeBuyer", "Features", numTrees: 20));

            var dynamicModel = dynamicLearningPipeline.Fit(trainData);

        }
Vijay27anand commented 5 years ago

Here is the code. Please post back if you get it to work:

        List<Customer> customers;
        IDataView trainData; 
        SqlConnection connection = new SqlConnection("Server=.;Initial Catalog = AdventureWorksDW2012;Integrated Security=true;");

        var env = new LocalEnvironment();

        // Turn the data into the ML.NET IDataView.
        // Can use CreateStreamingDataView for IEnumerable or CreateDataView for IList
        // https://stackoverflow.com/questions/1464883/how-can-i-easily-convert-datareader-to-listt
        //https://stackoverflow.com/questions/6939054/best-method-to-use-idatareader-as-ienumerablet

        //DataTable dt = new DataTable();
        //dt.Load(reader);

        //reader.Close();
        //DataRow[] rows = dt.Select();

        using (connection)
        {
            SqlCommand command = new SqlCommand("SELECT TOP 80 PERCENT * FROM vTargetMail;", connection);
            connection.Open();
            SqlDataReader reader = command.ExecuteReader();
            customers = Customer.FromReader(reader);

            reader.Close();

            // Turn the data into the ML.NET data view.
            // We can use CreateDataView or CreateStreamingDataView, depending on whether 'churnData' is an IList, or merely an IEnumerable.
            trainData = env.CreateDataView(customers);

            var classification = new BinaryClassificationContext(env);
            var dynamicLearningPipeline = new CategoricalEstimator(env, "BikeBuyer")
                .Append(new ConcatEstimator(env, "Features", "MaritalStatus", "Gender", "YearlyIncome", "TotalChildren", "Education", "Occupation", "HouseOwnerFlag", "NumberCarsOwned", "CommuteDistance", "Age"))
                //.Append(new ConcatEstimator(env, "Features", "MaritalStatus", "Gender", "Education", "Occupation", "HouseOwnerFlag", "CommuteDistance"))
                .Append(new FastTreeBinaryClassificationTrainer(env, "BikeBuyer", "Features", numTrees: 20));

            var dynamicModel = dynamicLearningPipeline.Fit(trainData);

        }

@thracian2015 thanks for the example, i my case i don't have a well defined class like customer at design time. I have a query that is dynamic and I know the properties only know at runtime. Is there a way to load a dataset (rows) and corresponding schema (columns with types) to get the IDataview, similar to what is done with CSV file. As as work around i convert the dataset, columns to a CSV and then load. Is there a direct way for my use case @Zruty0, @CESARDELATORRE

Vijay27anand commented 5 years ago

@Zruty0 , @CESARDELATORRE i don't have a well defined class like customer at design time. I have a query that is dynamic and I know the properties only know at runtime, similar to get from a datatable. Is there a way to load a dataset (rows) and corresponding schema (columns with types) to get the IDataview, similar to what is done with CSV file. As as work around i convert the dataset, columns to a CSV and then load. Is there a direct way for my use case.

rogancarr commented 5 years ago

@Vijay27anand You should be able to do this with

public IDataView LoadFromEnumerable<TRow>(IEnumerable<TRow> data, SchemaDefinition schemaDefinition = null)

e.g.

var data = mlContext.Data.LoadFromEnumerable(myEnumerable, mySchemaDefinition);
Vijay27anand commented 5 years ago

@rogancarr thanks for the information, is this be part of your upcoming release 0.11, as i don't see in the 0.10 library.

rogancarr commented 5 years ago

@Vijay27anand It is in 0.10, but it is called ReadFromEnumerable. We have renamed it to LoadFromEnumerable for the v1.0 release as part of our naming-normalization work. You should be good to go without updating your nugets.

Vijay27anand commented 5 years ago

@rogancarr it worked for me, i have one more scenario to load from Datatable, but i am not able to load from datatable, i get exception.

var trainData1 = mlContext1.Data.ReadFromEnumerable(datatable.AsEnumerable(), columns);

System.ArgumentOutOfRangeException: 'No field or property with name 'CustomerID' found in type 'System.Data.DataRow' Parameter name: userSchemaDefinition'

CESARDELATORRE commented 5 years ago

Closing this issue since it is stale and we already have a database sample.

Check our latest Database sample here and provide feedback since this is an area (training from data in databases) we want to evolve after ML.NET 1.1:

https://github.com/dotnet/machinelearning-samples/tree/master/samples/csharp/getting-started/DatabaseIntegration