dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.04k stars 1.89k forks source link

Concatenate 2 columns as a label #282

Closed PirateNick closed 5 years ago

PirateNick commented 6 years ago

I've been wondering if it was possible to concatenate 2 columns of the datatype string into the label column.

What I tried was: pipeline.Add(new ColumnConcatenator("Label", "string1", "string2"));

But that just spits out a V2(text, 2). And a Label must be of type R4-R8.

The reason I need this is because I only have 2 input variables (day and time) and I want to use regression to determine which is the best. I also stumbled upon LabelToFloat which I thought would convert this to a suitable type, but it didnt.

Anyone know how to solve this? Thanks!

Ivanidzo4ka commented 6 years ago

Regression problem predicts some continues value, in our cases we prefer to present this values as single or double precision float values. In your particular case I would probably suggest to convert your date and time strings value into standard DateTime and use DateTime.Ticks as Label column, or amount of minutes starting from certain date, or hours, it depends how precise you want your prediction.

This is in case if you have your data locally, if you use TextLoader you can read your data as DateTime and use Transforms.ColumnTypeConverter to cast it into ticks.

TomFinley commented 6 years ago

Hi @NickPeelman , could you spare a few more words on what you expected to happen when you concatenated the two columns? In particular, a focus on how you would have preferred a machine learning algorithm to interpret this label would be appreciated. The reason I ask is, from most machine learning algorithm's perspective, a vector with two strings is not meaningful input as a label. We may then be able to suggest a specific transform chain, to transform this into a meaningful label. And even failing that we'd be able to tell one area where we are failing to address your scenario.

PirateNick commented 6 years ago

I'm not entirely sure if ML is the right approach (or maybe overkill). But I have a certain amount of data with days and hours. Based on those values I want to predict the best date for the current time.

For example: the dataset contains a few more monday's than any other day, the algorithm should return monday, and the same for the time. I want to be able to predict the most used or active time.

That's why I thought the hour and day should be merged as a label, since that's the key that distinguishes each row of data.

I could generate a third column in my dataset like 'monday, 12:50, monday12:50' and use the last one as a label, but that just seems dodgy.

GalOshri commented 6 years ago

Could you please expand a bit on "predict the best date for the current time"?

Perhaps one way to think about it is "what is the input to the predictive function you want to build, and what is the expected output?". It sounds like you are trying to predict the day and time (label) that some event happens, but it is not clear what is the input.

PirateNick commented 6 years ago

I continued finding a solution, and when providing more data I noticed the algorithm worked much better.

My input data now looks like:

Age,Sex,Day,Hour

For each time a user clicks on an 'event', his age, sex, the current day and hour gets logged. From all those logged entries, I want to predict the best time for the current day for a given age and sex.

So, when I gathered a lot of entries I want ot use that as a base model, so I can try to predict when a user of a certian age and sex is more likely to 'click' on a certain event:

        public class TimeSuggestionData
    {
        [Column(ordinal: "0")]
        public float age;
        [Column(ordinal: "1")]
        public float sex;
        [Column(ordinal: "2")]
        public float day;
        [Column(ordinal: "3", name: "Label")]
        public float time;
    }
        public class TimeSuggestionPrediction
    {
        [ColumnName("Score")]
        public float time;
    }
        var pipeline = new LearningPipeline();
        pipeline.Add(new TextLoader<TimeSuggestionData>(modelPath, separator: ","));
        pipeline.Add(new CategoricalOneHotVectorizer("age", "sex", "day"));
        pipeline.Add(new ColumnConcatenator("Features", "age", "sex", "day"));
        pipeline.Add(new FastTreeRegressor());
        model = pipeline.Train<TimeSuggestionData, TimeSuggestionPrediction>();

When I generated some random entries (about 10.000) I kind of get the right amount of predictions. (All merely the same, cause random isn't that random I suppose).

But when I tried to add only 10 ish rows with very specific data, 10 entries like:

Age,Sex,Day,Hour 30,0,monday,12 30,0,monday,12 30,0,monday,12 30,0,monday,12 30,0,monday,12 30,0,monday,12 30,0,monday,12

and do a prediction on the trained model with the exact same parameters, I get a prediction back with time: 0. I thought it woudl return 12, since it's the only (and best) value.

EDIT: sex is a float because I use 0 as male and 1 as female

GalOshri commented 6 years ago

I expect the issue with only having 10 examples in the training data is that there aren't enough examples to train the model. I just tried training on your example data (<10 rows) and see the following warning:

Warning: 100 of the boosting iterations failed to grow a tree. This is commonly because the minimum documents in leaf hyperparameter was set too high for this dataset.

Do you see the same warning? This could explain why the model is not providing meaningful predictions.

sfilipi commented 5 years ago

Closing, since the initial question is resolved, and training on 10 data points is not a realistic scenario. Please feel free to open another issue if you are not reaching the right accuracy.