RubixML / ML

A high-level machine learning and deep learning library for the PHP language.
https://rubixml.com
MIT License
2.02k stars 182 forks source link

Old obsolete info #28

Closed realrecordzLab closed 4 years ago

realrecordzLab commented 5 years ago

As a training to start with the machine learning with PHP by using this library, I want to create a simple project that will use the old lotto numbers to predict if a generated random numbers sequence can contain some winning numbers and the probability that the sequence will be extracted. My dataset for now is composed from the last year winning numbers series, but I'm planning to expand the dataset to add the last three years of winning extractions. My question is how I can implement the library features, in particular what is the best feature of this library I can use, and how to train correctly the AI, the dataset is unlabeled because there is no label that can classify the numbers series, this because I only got the winning number series. I'm reading the documentations and I've read some of the tutorials, but some help on how to start with this problem will be appreciated.

andrewdalpino commented 5 years ago

Hi @realrecordzLab thanks for the great question, it's an interesting problem

Let's take a step back for a moment to consider the 5 types of ML that are offered in Rubix

The first class of learners to consider are the supervised ones - which consist of Classifiers and Regressors

Supervised learners work by learning a mapping of the input signal to an output prediction using the information found in a separate training signal (i.e. labels).

The second class of learners arre the unsupervised ones - which consist of Clusterers, Anomaly Detectors, and Embedders (or manifold learners).

Unsupervised learners seek to learn something about the underlying distribution of the data without a training signal (or labels).

Now that the basics are out of the way, we need to formulate the problem as a machine learning one

Firstly, I have to point out that you are making an assumption that the probability of winning can be determined by the underlying structure (distribution) of numbers. That may be a correct assumption, or it may not be (The lottery is supposed to be random).

If that is not a correct assumption, the best answer that you can get from an estimator is "How likely was this number series to have won in the past?" In otherwords, we cannot generalize to the future.

If you are ok with making that assumption then here is how you would build a probabilistic classifier in Rubix ...

  1. Generate a set of synthetic lottery number series that have NOT won in the past and label them
  2. Label the lottery numbers that won as such
  3. Combine the synthetic data with the real data to form a Labeled training set
  4. Pick a Probabilistic classifier to experiment with such as Gaussian Naive Bayes or Logistic Regression
  5. Train the learner
  6. Use Cross Validation to determine how well the model generalizes
  7. Iterate (back to step 4)

I'll be happy to answer any further questions

realrecordzLab commented 5 years ago

@andrewdalpino thanks for the reply. I'm making some modifications to the code by following your suggestions.

Generate a set of synthetic lottery number series that have NOT won in the past and label them

I'm using the php mt_rand function to generate the serie of eight number with a range from 1 to 90. I don't know what label I can assign, any suggestion will be useful. As you suggested I'm now assigning the "won" label to the numbers series that were extracted in the past.

Combine the synthetic data with the real data to form a Labeled training set

What do you mean with combine? I need to put the generated number sequence inside the dataset?

andrewdalpino commented 5 years ago

I'm using the php mt_rand function to generate the serie of eight number with a range from 1 to 90. I don't know what label I can assign, any suggestion will be useful.

What about 'lost?'

The idea is that we need to show examples of both winning and losing lottery numbers so that the learner can form a distinction (if there truly is one)

Combine the synthetic data with the real data to form a Labeled training set

Essentially what you're doing here is mixing the 'won' and 'lost' samples together in a single dataset object to pass to the learner

Check out the docs on Dataset Objects

Here's an example

use Rubix\ML\Datasets\Labeled;

// Import the won samples, and generate the lost samples

$wonLabels = array_fill(0, count($wonSamples), 'won');

$won = new Labeled($wonSamples, $wonLabels);

$lostLabels = array_fill(0, count($lostSamples), 'lost');

$lost = new Labeled($lostSamples, $lostLabels);

$dataset = $won->append($lost); // Combine them

[$training, $testing] = $dataset->randomize()->stratifiedSplit(0.8);

Remember that samples are a multidimensional array where each sample is an array of (integer) lottery numbers, and the labels are a flat array of string labels

realrecordzLab commented 5 years ago

I'm using the php mt_rand function to generate the serie of eight number with a range from 1 to 90. I don't know what label I can assign, any suggestion will be useful.

What about 'lost?'

The idea is that we need to show examples of both winning and losing lottery numbers so that the learner can form a distinction (if there truly is one)

Combine the synthetic data with the real data to form a Labeled training set

Essentially what you're doing here is mixing the 'won' and 'lost' samples together in a single dataset object to pass to the learner

Check out the docs on Dataset Objects

Here's an example

use Rubix\ML\Datasets\Labeled;

// Import the won samples, and generate the lost samples

$wonLabels = array_fill(0, count($wonSamples), 'won');

$won = new Labeled($wonSamples, $wonLabels);

$lostLabels = array_fill(0, count($lostSamples), 'lost');

$lost = new Labeled($lostSamples, $lostLabels);

$dataset = $won->append($lost); // Combine them

[$training, $testing] = $dataset->randomize()->stratifiedSplit(0.8);

Remember that samples are a multidimensional array where each sample is an array of (integer) lottery numbers, and the labels are a flat array of string labels

I'm implementing the code with your suggestions. I only have a problem with the labels. I will get this error, a suggestion for fixing it will be appreciated.

Uncaught InvalidArgumentException: Label must be a string or numeric type, array found

Thanks for the suggestions. It will be nice If a tutorial about this argument is inserted inside the library documentation. I think also that another interesting argument that can be accomplished by using this library is the football matches predictions, It's a little bit complex to prepare the dataset, but it will cover many aspects of the possible usage of this useful library.

andrewdalpino commented 5 years ago

It sounds like you are passing an array of arrays as a labels instead of an array of strings

array_fill(0, count($wonSamples), 'won');

... should return a flat array of strings

If you post your code, perhaps we can locate the issue

The documentation that you are suggesting can be found in the current documentation under Labeled datasets

If you think this documentation is not clear, I'm open to suggestions or PRs to make it more clear

Predicting the outcome of a football game sounds like an interesting problem as well

realrecordzLab commented 5 years ago

@andrewdalpino

$winSamples = [];
$lostSamples = [];
$winLabels = [];
$lostLabels = [];

//load the CSV document from a file path
$csv = Reader::createFromPath('test.csv', 'r');
$csv->setDelimiter(';');
$csv->setHeaderOffset(0);

foreach( $csv->getRecords() as $i => $records){
  unset($records['concorso']);
  foreach (array_chunk($records, 8) as $record) {
    $winSamples[] = $record;
    for( $n = 0; $n < 8; $n++ ){
      $lostSamples[$i][] = mt_rand(1, 90);
    }
  }
}

$winLabels = array_fill(0, count($winSamples), 'win');

$win = new Labeled($winSamples, $winLabels);

$lostLabels = array_fill(0, count($lostSamples), 'lost');

$lost = new Labeled($lostSamples, $lostLabels);

$dataset = $win->append($lost);

[$training, $testing] = $dataset->randomize()->stratifiedSplit(0.8);

var_dump($winLabels);

//$dataset = Unlabeled::build($samples);
// Labeled
$estimator = new KDNeighborsRegressor(5, new Minkowski(4.0), true, 30);
//$estimator = new SoftmaxClassifier(256, new Momentum(0.001), 1e-4, 300, 1e-4, new CrossEntropy());
//$estimator = new SVC(1.0, new Linear(), true, 1e-3, 100.);

#$estimator = new DBSCAN(4.0, 5, new Diagonal(), 20);
$estimator->train($dataset);

var_dump($estimator->trained());

This is my code, I encountered different problem or errors, maybe because I need to get more confident with the estimators. I need to take an input sequence and then output the prediction based on training. This is what I will implement in the code.

Fatal error: Uncaught InvalidArgumentException: Estimator is not compatible with the data types given. I've found also a solution for this error, It was caused because the csv reader import all contents as strings. I just solved by transforming the integers values from string to numbers.

Predicting the outcome of a football game sounds like an interesting problem as well

Yes, this is why I'm focusing on this training with ML. Football data are a bit different to analyze, so the dataset features preparation is an important step before start training the AI.

andrewdalpino commented 5 years ago

Hi @realrecordzLab

It looks like you are on the right track, keep experimenting and things will become clearer

Yes, League CSV treats every field as a string

The Numeric String Converter was designed for this circumstance

Ex.

use Rubix\ML\Transformers\NumericStringConverter;

$dataset->apply(new NumericStringConverter());

Keep us up to date on how you are using the library and as always I'll be happy to answer any further questions about the library

realrecordzLab commented 5 years ago

Yes, League CSV treats every field as a string

The Numeric String Converter was designed for this circumstance

Ex.

use Rubix\ML\Transformers\NumericStringConverter;

$dataset->apply(new NumericStringConverter());

Yes, this is the solution I've applied. I've also used the Params::ints() to generate the random numbers I need to test the script (the unlabeled input sequence) and the lost samples (I've saved them inside a csv named random_lotto_numbers.csv using the writer of League CSV). Now I have a more clear code that will use all the helpers features available with the library.

@andrewdalpino I will be happy to share the code as soon as I've finished it. I have just a couple of questions about two features I'm going to implement. The first one is about the estimators training. I've read on the docs that is it possible to train an estimator with new data if needed, in my case, I want to pass the content of a new csv file to the estimator, if it's compatible with the online interface, I can use the partial() method. How this feature exactly work?

The second question is about saving the data model. Every time I refresh the page to test the script for new predictions (I'm not sure about this), the estimator, as i can understand, will be retrained. I've read about persisters, if I'm not wrong they are useful to save the trained estimator and his data model right? An example will be appreciated.

andrewdalpino commented 5 years ago

Hey @realrecordzLab I'm going to answer your questions in reverse order because the second answer will make more sense

The second question is about saving the data model. Every time I refresh the page to test the script for new predictions (I'm not sure about this), the estimator, as i can understand, will be retrained. I've read about persisters, if I'm not wrong they are useful to save the trained estimator and his data model right? An example will be appreciated.

Correct, Persisters will handle serializing and storing the estimator - I would recommend wrapping the estimator in a Persistent Model meta-estimator for a nice save() and load() API

Here is a tutorial using model persistence https://github.com/RubixML/Credit

The first one is about the estimators training. I've read on the docs that is it possible to train an estimator with new data if needed, in my case, I want to pass the content of a new csv file to the estimator, if it's compatible with the online interface, I can use the partial() method. How this feature exactly work?

You read correctly! It works just like the train() method, except that it takes off where the last training round left off - here is a trivial example

$folds = $dataset->fold(3);

$estimator->train($folds[0]);

$estimator->partial($folds[1]);

$estimator->partial($folds[2]);

Here are the docs for the Online interface

realrecordzLab commented 5 years ago

You read correctly! It works just like the train() method, except that it takes off where the last training round left off

@andrewdalpino perfect, this will give me the ability to put in the model more and more user inputted data. I have tried the online interface, but I wasn't sure about the fact that the trained model was continuing to learn from the new inserted data. The only different thing about this, is the code implementation, I'm not folding the dataset but I think this isn't a problem?

Correct, Persisters will handle serializing and storing the estimator - I would recommend wrapping the estimator in a Persistent Model meta-estimator for a nice save() and load() API

Great. I will try to implement this feature, I want that the user after inputted some numbers serie, get a prediction, this mean that the estimator need to be already trained and will learn from the new data.

andrewdalpino commented 5 years ago

@realrecordzLab

The only different thing about this, is the code implementation, I'm not folding the dataset but I think this isn't a problem?

I just used a folded dataset as a trivial example, in reality, those will be datasets collected at different times in the future

Do you plan to serve the model? If so, the Rubix Server library runs it as a daemon in a separate process with its own network stack

This is the preferred way of serving the model as it doesn't require loading into memory for every request which can be expensive if you have a big model

realrecordzLab commented 5 years ago

the Rubix Server library runs it as a daemon in a separate process with its own network stack This is the preferred way of serving the model as it doesn't require loading into memory for every request which can be expensive if you have a big model

I will give it a try. I don't know if it will work because for now the script will run on a shared host, but I think this isn't a problem.

henrique-borba commented 5 years ago

Hi,

I'm not as good as Andrew with ML algorithms, but I could't stop myself from comment on this issue.

I don't think this is a good trainning for you @realrecordzLab, the problem arises with the Lotto mechanism itself. All sequence of numbers of a traditional lottery are random, so by itself it's not predictable.

That said, I guess your accuracy, loss and prediction won't make sense after all. It's like feeding a ML algorithm with a random feature matrix with a random label vector.

If you are trainning, try some predictable problems, avoid at all cost random features. One single random feature can also f*** up your "predictable problem", because if it don't make sense even for our brain, it won't make sense for ML.

Correct me if I'm wrong @andrewdalpino

Best Regards

realrecordzLab commented 5 years ago

Hi,

I'm not as good as Andrew with ML algorithms, but I could't stop myself from comment on this issue.

I don't think this is a good trainning for you @realrecordzLab, the problem arises with the Lotto mechanism itself. All sequence of numbers of a traditional lottery are random, so by itself it's not predictable.

That said, I guess your accuracy, loss and prediction won't make sense after all. It's like feeding a ML algorithm with a random feature matrix with a random label vector.

If you are trainning, try some predictable problems, avoid at all cost random features. One single random feature can also f*** up your "predictable problem", because if it don't make sense even for our brain, it won't make sense for ML.

Correct me if I'm wrong @andrewdalpino

Best Regards

This is just a training to take confidence with the library, I know thah the ML will work better with real problems, I appreciate your comment. If you have suggestion on a what a good training will be, please let me know.