Closed thracian2015 closed 5 years ago
Thank you! This helps but do we have a sample demonstrating how to use data from a relational database? Is SqlDataReader good enough since it implements IEnumerable or is there a more efficient way?
If you want it for the learning pipeline, then the CollectionDataSource
can take in an IEnumerable
to be passed into the pipeline. I'm unsure if this is an option in the new API, however.
With the new API (Pigsty/Static/Typed) you do it like the following:
var env = new LocalEnvironment();
// Load data from a database like using Entity Framework or any other way
IEnumerable
// Turn the data into the ML.NET IDataView. // Can use CreateStreamingDataView for IEnumerable or CreateDataView for IList var trainData = env.CreateStreamingDataView(myData);
More info here: https://github.com/dotnet/machinelearning/blob/master/docs/code/MlNetCookBook.md#what-if-my-training-data-is-not-in-a-text-file
Thanks Cesar but apparently is more than that, which is why I asked for a sample. SqlDataReader reader = command.ExecuteReader(); DataTable dt = new DataTable(); dt.Load(reader); var trainData = env.CreateStreamingDataView(dt.AsEnumerable()); Exception "Could not determine an IDataView type for member RowState Parameter name: rawType'"
Do I need to cast each row to a custom class?
I got it to work by using a custom list but now I have another issue. Basically, what I'm trying to predict is the customer probability to purchase a product. My features are demographic information, such as YearlyIncome, Education, Occupation. Note that some fields are integers or floats. Upon training the classification model, I get a 'Schema mismatch' error.
trainData = env.CreateDataView(customers);
var classification = new BinaryClassificationContext(env);
var dynamicLearningPipeline = new CategoricalEstimator(env, "ProductBuyer")
.Append(new ConcatEstimator(env, "Features", "MaritalStatus", "Gender", "YearlyIncome", "TotalChildren", "Education", "Occupation", "HouseOwnerFlag", "NumberCarsOwned", "CommuteDistance", "Age"))
.Append(new FastTreeBinaryClassificationTrainer(env, "ProductBuyer", "Features", numTrees: 20));
var dynamicModel = dynamicLearningPipeline.Fit(trainData); ***Exception 'Schema mismatch for input column 'YearlyIncome': expected Text, got R4
@thracian2015 , ML.NET learners expect features to come in the form of a float vector. Anything that's not a float vector will have to be explicitly converted, this process is known as 'feature extraction'.
Please take a look at our new cookbook for samples of how you can do this.
In another note, using a custom list was the right choice at the time. We planned to have a SQL reader for a long time, but it hasn't yet materialized.
Thanks Peter but I'm still lost. Your cookbook shows how to vectorize features using the TextLoader reader extensions. I'm not using TextLoader but DataReader (SqlDataReader to be precise). Do I need to decorate my custom class fields somehow to convert them to float vectors?
If it's easier, can I send the code sample to someone to take a look at?
@thracian2015 I think the part IEnumerable myData = GetDataFromDB();
is relevant in that you can take the result from DB as a IEnumerable<T>
. Using SqlDataReader
is a bit tedious unless you have used it before, so perhaps Entity Framework or Dapper would be a better option. If you want to use ADO.NET, an example could be https://blogs.msdn.microsoft.com/adonet/2012/07/15/using-sqldatareaders-new-async-methods-in-net-4-5-part-2-examples/ (see also part 1) -- or https://docs.microsoft.com/en-us/dotnet/framework/data/adonet/retrieving-data-using-a-datareader.
A bit tangential, but if you have a lot of data, you can pull data out from the DB record-by-record (chunk of records by chunk of records to be more precise) and have efficiency gains.
<edit: A fuller example looks like https://github.com/dotnet/orleans/blob/master/src/AdoNet/Shared/Storage/RelationalStorage.cs#L214 and some sugaring https://github.com/dotnet/orleans/blob/master/src/AdoNet/Shared/Storage/RelationalStorageExtensions.cs#L186. Notably IAsyncEnumerable
isn't there, but maybe it shows how it could be added. It's not much code.
@veikkoeeva I'm past loading data. That part works now with a custom list. What I have a problem with now is training the data. See my reply to @Zruty0
I think in your dataset both MaritalStatus
and Gender
are strings, so you should have passed them through the CategoricalEstimator
(together with all other categorical features).
The Concat
should be concatenating floats, not columns of mixed types.
I got it to work by using a custom list but now I have another issue. Basically, what I'm trying to predict is the customer probability to purchase a product. My features are demographic information, such as YearlyIncome, Education, Occupation. Note that some fields are integers or floats. Upon training the classification model, I get a 'Schema mismatch' error.
trainData = env.CreateDataView(customers); var classification = new BinaryClassificationContext(env); var dynamicLearningPipeline = new CategoricalEstimator(env, "ProductBuyer") .Append(new ConcatEstimator(env, "Features", "MaritalStatus", "Gender", "YearlyIncome", "TotalChildren", "Education", "Occupation", "HouseOwnerFlag", "NumberCarsOwned", "CommuteDistance", "Age")) .Append(new FastTreeBinaryClassificationTrainer(env, "ProductBuyer", "Features", numTrees: 20));
var dynamicModel = dynamicLearningPipeline.Fit(trainData); ***Exception 'Schema mismatch for input column 'YearlyIncome': expected Text, got R4
@thracian2015 can you show me how you used the Datatable to custom list to make it work with the CreateDataView
Here is the code. Please post back if you get it to work:
List<Customer> customers;
IDataView trainData;
SqlConnection connection = new SqlConnection("Server=.;Initial Catalog = AdventureWorksDW2012;Integrated Security=true;");
var env = new LocalEnvironment();
// Turn the data into the ML.NET IDataView.
// Can use CreateStreamingDataView for IEnumerable or CreateDataView for IList
// https://stackoverflow.com/questions/1464883/how-can-i-easily-convert-datareader-to-listt
//https://stackoverflow.com/questions/6939054/best-method-to-use-idatareader-as-ienumerablet
//DataTable dt = new DataTable();
//dt.Load(reader);
//reader.Close();
//DataRow[] rows = dt.Select();
using (connection)
{
SqlCommand command = new SqlCommand("SELECT TOP 80 PERCENT * FROM vTargetMail;", connection);
connection.Open();
SqlDataReader reader = command.ExecuteReader();
customers = Customer.FromReader(reader);
reader.Close();
// Turn the data into the ML.NET data view.
// We can use CreateDataView or CreateStreamingDataView, depending on whether 'churnData' is an IList, or merely an IEnumerable.
trainData = env.CreateDataView(customers);
var classification = new BinaryClassificationContext(env);
var dynamicLearningPipeline = new CategoricalEstimator(env, "BikeBuyer")
.Append(new ConcatEstimator(env, "Features", "MaritalStatus", "Gender", "YearlyIncome", "TotalChildren", "Education", "Occupation", "HouseOwnerFlag", "NumberCarsOwned", "CommuteDistance", "Age"))
//.Append(new ConcatEstimator(env, "Features", "MaritalStatus", "Gender", "Education", "Occupation", "HouseOwnerFlag", "CommuteDistance"))
.Append(new FastTreeBinaryClassificationTrainer(env, "BikeBuyer", "Features", numTrees: 20));
var dynamicModel = dynamicLearningPipeline.Fit(trainData);
}
Here is the code. Please post back if you get it to work:
List<Customer> customers; IDataView trainData; SqlConnection connection = new SqlConnection("Server=.;Initial Catalog = AdventureWorksDW2012;Integrated Security=true;"); var env = new LocalEnvironment(); // Turn the data into the ML.NET IDataView. // Can use CreateStreamingDataView for IEnumerable or CreateDataView for IList // https://stackoverflow.com/questions/1464883/how-can-i-easily-convert-datareader-to-listt //https://stackoverflow.com/questions/6939054/best-method-to-use-idatareader-as-ienumerablet //DataTable dt = new DataTable(); //dt.Load(reader); //reader.Close(); //DataRow[] rows = dt.Select(); using (connection) { SqlCommand command = new SqlCommand("SELECT TOP 80 PERCENT * FROM vTargetMail;", connection); connection.Open(); SqlDataReader reader = command.ExecuteReader(); customers = Customer.FromReader(reader); reader.Close(); // Turn the data into the ML.NET data view. // We can use CreateDataView or CreateStreamingDataView, depending on whether 'churnData' is an IList, or merely an IEnumerable. trainData = env.CreateDataView(customers); var classification = new BinaryClassificationContext(env); var dynamicLearningPipeline = new CategoricalEstimator(env, "BikeBuyer") .Append(new ConcatEstimator(env, "Features", "MaritalStatus", "Gender", "YearlyIncome", "TotalChildren", "Education", "Occupation", "HouseOwnerFlag", "NumberCarsOwned", "CommuteDistance", "Age")) //.Append(new ConcatEstimator(env, "Features", "MaritalStatus", "Gender", "Education", "Occupation", "HouseOwnerFlag", "CommuteDistance")) .Append(new FastTreeBinaryClassificationTrainer(env, "BikeBuyer", "Features", numTrees: 20)); var dynamicModel = dynamicLearningPipeline.Fit(trainData); }
@thracian2015 thanks for the example, i my case i don't have a well defined class like customer at design time. I have a query that is dynamic and I know the properties only know at runtime. Is there a way to load a dataset (rows) and corresponding schema (columns with types) to get the IDataview, similar to what is done with CSV file. As as work around i convert the dataset, columns to a CSV and then load. Is there a direct way for my use case @Zruty0, @CESARDELATORRE
@Zruty0 , @CESARDELATORRE i don't have a well defined class like customer at design time. I have a query that is dynamic and I know the properties only know at runtime, similar to get from a datatable. Is there a way to load a dataset (rows) and corresponding schema (columns with types) to get the IDataview, similar to what is done with CSV file. As as work around i convert the dataset, columns to a CSV and then load. Is there a direct way for my use case.
@Vijay27anand You should be able to do this with
public IDataView LoadFromEnumerable<TRow>(IEnumerable<TRow> data, SchemaDefinition schemaDefinition = null)
e.g.
var data = mlContext.Data.LoadFromEnumerable(myEnumerable, mySchemaDefinition);
@rogancarr thanks for the information, is this be part of your upcoming release 0.11, as i don't see in the 0.10 library.
@Vijay27anand It is in 0.10, but it is called ReadFromEnumerable
. We have renamed it to LoadFromEnumerable
for the v1.0 release as part of our naming-normalization work. You should be good to go without updating your nugets.
@rogancarr it worked for me, i have one more scenario to load from Datatable, but i am not able to load from datatable, i get exception.
var trainData1 = mlContext1.Data.ReadFromEnumerable(datatable.AsEnumerable(), columns);
System.ArgumentOutOfRangeException: 'No field or property with name 'CustomerID' found in type 'System.Data.DataRow' Parameter name: userSchemaDefinition'
Closing this issue since it is stale and we already have a database sample.
Check our latest Database sample here and provide feedback since this is an area (training from data in databases) we want to evolve after ML.NET 1.1:
Do we have a sample that demonstrates how to use ML.NET with data from a relational database, such as SQL Server? I don't need a complete sample but just how to pass such a dataset, instead of using TextLoader?
Also, from what I can see, the samples use LearningPipeline which I've read is deprecated. Do we have a sample that demonstrates the new API?