RubixML / ML

A high-level machine learning and deep learning library for the PHP language.
https://rubixml.com
MIT License
2.02k stars 183 forks source link

How should we do something like More Like This (MLT) in Rubix ML? #75

Open marclaporte opened 4 years ago

marclaporte commented 4 years ago

MoreLikeThis is something we use massively. I am confident there is a way to replicate this with Rubix ML (and even have a learning system), but I am wondering about a suggested approach.

Related links:

Thanks!

andrewdalpino commented 4 years ago

Great question @marclaporte!

I am not expert but I read the document you linked and it looks like they're performing some type of full-text search using TF-IDF (Term Frequency Inverse Document Frequency) scores. Here is the algorithm that they are using.

https://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

It looks like they use document distance as the method of choosing related documents. If that is the case then selecting the document that is 'most like this' is just a matter of finding the document's nearest neighbor. You can accomplish this in Rubix ML with the Word Count Vectorizer and TF-IDF Transformer and the KDNeighbors classifier with k set to 1. The label of each sample document in this case will be the identifier of a unique entry in a database such as another document or question/answer pair.

use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Transformers\WordCountVectorizer;
use Rubix\ML\Transformers\TFIDFTransformer;
use Rubix\ML\Classifiers\KDNeighbors;

// Import training samples and labels

$dataset = Labeled::build($samples, $labels)
    ->apply(new WordCountVectorizer())
    ->apply(new TFIDFTransformer());

$estimator = new KDNeighbors(1);

$estimator->train($dataset);

$mlt = $estimator->predictSample($sample);

Where $sample is the TF-IDF encoded document that you want to find the MLT of and $mlt is the label or index of the MLT document. Since you'll be using a spatial tree for the queries it will be very fast O(n log n) instead of O(n^2) as with the brute force method.

Some other approaches that you can take with Rubix ML include other Learners such as Naive Bayes or a Multilayer Perceptron. Your input will still be the TF-IDF weighted word counts but these models may work better (or worse) for your particular problem. Here is an example project that classifies the sentiment of a document using an MLP.

https://github.com/RubixML/Sentiment

Thanks again for the great question!

alaindesilets commented 4 years ago

In Marc's case, a typical scenario involves finding the 10 to 50 most similar documents among a possibly large set (say, 10K documents). So we would have to use k=10 to 50, not 1.

How well does the k-Nearest Neighbour algorithm scale? How long would it take to identify the k=10 nearest neighbours among a set of say, 5K documents, each consisting of say, 2 pages?

From what I understand, ES's MLT algorithm proceeds in 3 phases:

This allows it to scale to very large sets of documents, because it does not need to evaluate the document similarity with each and every document in the dataset.

andrewdalpino commented 4 years ago

Hey @alaindesilets Yes you'd increase k with the number of similar documents you wanted retrieved and instead of the predict() API you'd use proba() to return an array of probabilities. Sort that array and then return the top k to get the k MLT documents. This is assuming every sample is labeled uniquely.

Scaling KNN depends on the underlying implementation. In Rubix we have Estimators for both brute force and spatial tree accelerated implementations. There are tradeoffs with each but the tree versions are much faster. The search time complexity of the tree-based Estimator is O(n log n) and since it uses summary nodes (Ball or Box nodes) we are able to pre-prune branches before entering them. Note the difference in the benchmarks below.

Annotation 2020-06-13 222410

The test conditions are 2,500 training and 10,000 test samples of 4 dimensions. I haven't done a benchmark using TF-IDF scores but I'd encourage you to do some on a prototype model if performance and scalability are a concern. Having that said, due to the lazy learning nature of KNN, it's not the most scalable algorithm to choose from in the library. I go a little deeper into the pro and cons of each Estimator here.

It sounds like we're on the same page with the Elasticsearch algorithm. One way you can narrow the input feature space down in Rubix is through the Word Count Vectorizer. Particularly the max vocabulary and min document frequency parameters. The example below will use only the top 1000 words/tokens (ranked by term frequency) that appear in at least 3 different documents as the vocabulary. You can change the tokenizer from Word to N-gram or Skip Gram as well. This is not exactly the same as the Elasticsearch algorithm just pointing that out.

use Rubix\ML\Transformers\WordCountVectorizer;
use Rubix\ML\Other\Tokenizers\NGram;

$dataset->apply(new WordCountVectorizer(1000, 3, new NGram(1, 2)));

Another way to reduce the size of the input feature space would be to perform dimensionality reduction or feature selection after computing the TF-IDF values. I go a little into this here.

I'd say if you wanted to replicate the Elasticsearch algorithm using ML then just use Elasticsearch but if you wanted to take control of the algorithm and make changes to it over time then you can design something in Rubix.

andrewdalpino commented 4 years ago

We added maximum document frequency to Word Count Vectorizer in the last commit (https://github.com/RubixML/RubixML/commit/52033972b1193671b6e4c1a4d005918c5a825e27) so now you can narrow the vocabulary by removing frequent tokens similarly to Elasticsearch.

Example

use Rubix\ML\Transformers\WordCountVectorizer;
use Rubix\ML\Other\Tokenizers\NGram;

$dataset->apply(new WordCountVectorizer(1000, 3, 500, new NGram(1, 2)));
marclaporte commented 4 years ago

We discussed this at the recent community meeting, and Victor and Andrew should meet up about this in coming weeks.

andrewdalpino commented 4 years ago

Great @marclaporte I already began doing some experiments on my end

Victor and I will take it from there when ready

kroky commented 4 years ago

In fact, I experimented with this as well a bit but have some questions. @andrewdalpino meet you on telegram!

andrewdalpino commented 4 years ago

Here is what I came up with - it uses the positive reviews from the Sentiment dataset. You should be able to drop this code right into the project (ex mlt.php). It's not BM25F but it's kinda like it and it can be improved over time. For example, we may be able to achieve even better accuracy by switching from TF-IDF weighted word counts to word or document embeddings (Word2Vec, Doc2Vec, FastText). This essentially brings an unsupervised form of deep learning into the picture to embed each document into a dense continuous vector - then we use the same spatial tree nearest neighbors search to run the search efficiently.

<?php

include __DIR__ . '/vendor/autoload.php';

use Rubix\ML\Other\Loggers\Screen;
use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Transformers\HTMLStripper;
use Rubix\ML\Transformers\TextNormalizer;
use Rubix\ML\Transformers\WordCountVectorizer;
use Rubix\ML\Transformers\TfIdfTransformer;
use Rubix\ML\Graph\Trees\BallTree;
use Rubix\ML\Kernels\Distance\Cosine;

ini_set('memory_limit', '-1');

$logger = new Screen();

$logger->info('Loading data into memory');

$samples = $labels = [];

foreach (glob("train/positive/*.txt") as $file) {
    $samples[] = [file_get_contents($file)];
    $labels[] =  basename($file);
}

$dataset = new Labeled($samples, $labels);

$dataset = $dataset->apply(new HTMLStripper())
    ->apply(new TextNormalizer())
    ->apply(new WordCountVectorizer(5000, 3, 5000))
    ->apply(new TfIdfTransformer());

$testing = $dataset->randomize()->take(1);

$tree = new BallTree(30, new Cosine());

$logger->info('Growing spatial tree');

$start = microtime(true);

$tree->grow($dataset);

$minutes = (microtime(true) - $start) / 60.0;

$logger->info("Done in $minutes minutes");

$start = microtime(true);

[$samples, $labels, $distances] = $tree->nearest($testing->sample(0), 5);

$seconds = microtime(true) - $start;

$logger->info("Search took $seconds seconds");

echo PHP_EOL;

echo 'Input Document:' . PHP_EOL;

echo file_get_contents("train/positive/{$testing->label(0)}") . PHP_EOL;

echo PHP_EOL;

echo 'Most like this:' . PHP_EOL;

echo file_get_contents("train/positive/$labels[0]") . PHP_EOL;

echo PHP_EOL;

echo 'Top 5 Files:' . PHP_EOL;

print_r($labels);

echo PHP_EOL;

echo 'Top 5 Document Distances:' . PHP_EOL;

print_r($distances) . PHP_EOL;
Annotation 2020-08-28 051604
kroky commented 4 years ago

What if I want to train an existing model and then just query against it in runtime? I suppose BallTree cannot be persisted with existing persisters and I'd need to use KDNeighbours? In fact, this works with pretty similar speed, so I'm planning to use it unless there is an option to persist a tree that is already grown:

<?php

require('vendor/autoload.php');

use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Extractors\CSV;
use Rubix\ML\Kernels\Distance\Cosine;
use Rubix\ML\Persisters\Filesystem;
use Rubix\ML\Pipeline;
use Rubix\ML\Transformers\WordCountVectorizer;
use Rubix\ML\Transformers\TfIdfTransformer;
use Rubix\ML\Transformers\StopWordFilter;
use Rubix\ML\Transformers\TextNormalizer;
use Rubix\ML\Classifiers\KDNeighbors;
use Rubix\ML\Graph\Trees\BallTree;

$stopwords = ['i','me','my','myself','we','our','ours','ourselves','you','your','yours','yourself','yourselves','he','him','his','himself','she','her','hers','herself','it','its','itself','they','them','their','theirs','themselves','what','which','who','whom','this','that','these','those','am','is','are','was','were','be','been','being','have','has','had','having','do','does','did','doing','a','an','the','and','but','if','or','because','as','until','while','of','at','by','for','with','about','against','between','into','through','during','before','after','above','below','to','from','up','down','in','out','on','off','over','under','again','further','then','once','here','there','when','where','why','how','all','any','both','each','few','more','most','other','some','such','no','nor','not','only','own','same','so','than','too','very','s','t','can','will','just','don','should','now'];
$nearest = 20;

$start = microtime(true);

$persister = new Filesystem('trained.model');
if (is_file('trained.model')) {
  $estimator = $persister->load();
} else {

  $dataset = Labeled::fromIterator(new CSV('tracker_46.csv'));
  $mdf = $dataset->numRows()/2;

  $estimator = new Pipeline([
    new TextNormalizer(),
    new StopWordFilter($stopwords),
    new WordCountVectorizer(10000, 1, $mdf),
    new TfIdfTransformer(),
  ], new KDNeighbors($nearest, true, new BallTree($nearest, new Cosine())));

  $estimator->train($dataset);

  $persister->save($estimator);
}

$mlt = $estimator->probaSample(["problem resolution and decision trees"]);
arsort($mlt);
$mlt = array_slice($mlt, 0, $nearest, true);
print_r($mlt);

echo (memory_get_peak_usage()/1024/1024)." MB\n";
$time = microtime(true) - $start;
echo $time." s\n";
andrewdalpino commented 4 years ago

What if I want to train an existing model and then just query against it in runtime? I suppose BallTree cannot be persisted with existing persisters and I'd need to use KDNeighbours?

Any object can be persisted (Ex. file_put_contents('tree.data', serialize($object));) but since Ball Tree is a lower level abstraction it doesn't work directly with Persister objects. Feel free to use the higher-level API, I only implemented MLT this way to try and get the best performance.

andrewdalpino commented 4 years ago

Just a heads up I've added a BM25 transformer to the Extras package that you can try out as well. This should be an improvement over TF-IDF for document retrieval. So far I've been getting great results. You can just replace the TF-IDF transformer with BM25 and viola!

https://github.com/RubixML/Extras/blob/master/docs/transformers/bm25-transformer.md

Annotation 2020-08-29 023656 Annotation 2020-08-29 031551
andrewdalpino commented 4 years ago

So @kroky with the new BM25 Transformer we can replicate 2 of Lucene's search strategies. The first is their BM25 method which we replicate by using the new BM25 Transformer in place of TF-IDF in the examples above.

https://lucene.apache.org/core/7_0_0/core/org/apache/lucene/search/similarities/BM25Similarity.html

The second is Lucene's TF-IDF strategy. Since their version of TF-IDF is just a special case of BM25 with $alpha (k1) set to 0, we can use the BM25 transformer to replicate their TF-IDF similarity strategy with document length normalization controlled by the $beta parameter. You'll notice they use Cosine as a similarity metric as well (Cosine distance == 1 - Cosine similarity). The only thing we are missing is query boosting and document boosting.

https://lucene.apache.org/core/7_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

And to be complete, you can also use the TF-IDF Transformer like before, however, this does not have document length normalization so is simpler than Lucene's TF-IDF strategy. In addition TF-IDF transformer does not dampen the effect of high term counts by taking the square root of the term frequency (however the new BM25 Transformer does use a non-linear TF dampening function controlled by the $alpha parameter).

andrewdalpino commented 4 years ago

Also @kroky just a heads up so you don't spend hours scratching your head like I did ... things get a little weird with Cosine distance and zero vectors (norm = 0). Cosine is undefined for zero vectors (a or b or both with 0 term counts from the vocabulary or a blank document) but we currently return 1 in this scenario which is correct in terms of the equation but breaks down under two circumstances. The first is that the distance from a zero vector to a non-zero vector will return 1 when really it should be 2 (the furthest distance possible) and the second is the distance between two zero vectors is computed as 1 when it really should be 0 (two blank documents can be interpreted as being the same).

Unfortunately, the only way I could think of so far to detect this condition is if we had the vector norms (if norm = 0 then vector contains all zeros) and so far I haven't been able to make this jive with the new optimization you implemented. However, the good news is that it does work with the implementation that I outlined here since it does compute the exact norms.

This wouldn't seem like such a big issue but unfortunately it causes Ball Tree to enter an infinite loop during construction :/ Can you think of another way around this using your implementation?

You can test the bug by inserting a zero vector as one of the samples and then training/growing the tree. You'll know there's a problem when the universe expands into nothingness before the algorithm completes.

Adding the following to Cosine distance compute right before the final return statement fixes the issue.

        if ($ssA === 0.0 and $ssB === 0.0) {
            return 0.0;
        }

        if ($ssA === 0.0 or $ssB === 0.0) {
            return 2.0;
        }
kroky commented 4 years ago

Yes, I noticed the infinite loop in Ball::split when one of the dataset samples was empty. It kept finding the same centeroid. I thought a solution would be to remove empty samples but then you are right that we could have zero vectors with 0 term counts from the vocabulary. How about this solution?

public function compute(array $a, array $b) : float
    {
        $sigma = $ssA = $ssB = 0.0;
        $zeroA = $zeroB = false;

        foreach ($a as $i => $valueA) {
            $valueB = $b[$i];
            $sigma += $valueA * $valueB;
            $zeroA = $zeroA || $valueA;
            $zeroB = $zeroB || $valueB;
        }

        $zeroA = !$zeroA;
        $zeroB = !$zeroB;

        if ($zeroA && $zeroB) {
            return 0.0;
        }

        if ($zeroA || $zeroB) {
            return 2.0;
        }

        if (abs($sigma) > EPSILON) {
            foreach ($a as $i => $valueA) {
                $ssA += $valueA ** 2;
                $ssB += $b[$i] ** 2;
            }
        }

        return 1.0 - ($sigma / (sqrt($ssA * $ssB) ?: EPSILON));
    }

I am running benchmarks with composer run benchmark -- --filter=Cosine but it keeps showing the same times, so not sure I can measure this correctly. Can you try? Boolean comparisons must be quick. We can also try checking for non-zeros with if statements but I think that will be slower.

kroky commented 4 years ago

BM25 is great! I read more about its superiority to TFIDF and seems like a very nice replacement. Our use-case is not related so much to the document length as most of the documents have similar lengths but a way to discard extremely popular words and rank higher the documents containing more of the search string words is awesome. I also found it is faster than TFIDF, not sure why - considerably faster for small corpus size and slightly faster for bigger ones.

andrewdalpino commented 4 years ago

I also found it is faster than TFIDF, not sure why - considerably faster for small corpus size and slightly faster for bigger ones.

It could be that the learner can construct a better Ball Tree (more efficient at clustering similar documents) with the new feature representation ¯_(ツ)_/¯

Our use-case is not related so much to the document length as most of the documents have similar lengths but a way to discard extremely popular words and rank higher the documents containing more of the search string words is awesome.

BM25 dampens the effect of high term counts but it doesn't discard them - that is the job of Word Count Vectorizer (through not including the word in the vocab in the first place). Any word with a document frequency greater than max document frequency (I think we'll change this setting to a proportion of the training set in 1.0.0) for example will not be included in the vocabulary.

andrewdalpino commented 4 years ago

Hey @kroky there was an issue with the benchmark but it's fixed now. I guess it wasn't calling the setUp() method to instantiate the kernel.

Annotation 2020-09-02 224159

Green is your original optimization, red is the new one with compensation for zero vectors, and green is the one I outlined here with zero vector compensation. I'm thinking we go with blue unless we can beat it or come close with more elegant code.

andrewdalpino commented 4 years ago

Also, you may find this useful. I experimented with adding dimensionality reduction to the features. Went from 10,000 to 500 features with hardly any loss in "relevancy". It does take longer to train (due to the O(n^3) matmul), but it may be worth it as the search time was reduced to less than 1 second. You may even get good results with alot less than 500 dimensions depending on your dataset. In my experiment I added Gaussian Random Projector to the pipeline. This effectively projects the BM25 vectors onto a spherically random hyperplane yet most of the information needed to identify the documents remains encoded in this new vector space of lower dimensionality. It's pretty crazy, you can look up the Johnson–Lindenstrauss lemma for more info as to why this works. Apparently this technique is called random indexing in the NLP world. Here's a paper I found that explains the concept.

use Rubix\ML\Transformers\HTMLStripper;
use Rubix\ML\Transformers\TextNormalizer;
use Rubix\ML\Transformers\WordCountVectorizer;
use Rubix\ML\Transformers\BM25Transformer;
use Rubix\ML\Transformers\GaussianRandomProjector;

$dataset = $dataset->apply(new HTMLStripper())
    ->apply(new TextNormalizer())
    ->apply(new WordCountVectorizer(10000, 3, 500))
    ->apply(new BM25Transformer(1.2, 0.75))
    ->apply(new GaussianRandomProjector(500));

This is the result

Annotation 2020-09-03 020514

Note that after random projection, there no longer is a column for the word 'Stargate' or any words from the vocabulary for that matter. However, the 500 dimensions still encode the entire vocabulary in the new vector space.

kroky commented 4 years ago

Very useful, thanks, I will take it into consideration.

Regarding optimization - I agree we should go with the fastest possible method as speed of one computation in this case affects performance a lot.

andrewdalpino commented 4 years ago

Ok @kroky, the fix will be out in 0.1.5 then (I went with the blue one) ... I really liked your implementation though (clever and elegant), I'm bummed it didn't work out with this undefined behavior

Btw I'd also recommend trying Sparse Random Projector (they use it in the random indexing paper). Hint in 0.2.0 it will be getting a feature to allow you to control the sparsity of the projections. For my dataset of 10,000ish movie reviews, 500 dimensions seems to be sufficient to capture most of the information contained in the high-dimensional (10,000) BM25 weighted word count vectors.

andrewdalpino commented 4 years ago

A couple more things @kroky - just added to the Extras repo is the new Token Hashing Vectorizer that works well for low memory footprint applications. It doesn't build a vocabulary like Word Count Vectorizer. Instead, it employs a hashing trick to bucket token counts into their place in a fixed-length feature vector. It does suffer from collisions though so beware of that, especially in lower dimensional vector spaces (I'd guess < 10,000). It doesn't have any control over very rare or stop words either (no max/min document frequency).

$dataset = $dataset->apply(new HTMLStripper())
    ->apply(new TextNormalizer())
    // ->apply(new WordCountVectorizer(10000, 3, 500))
    ->apply(new TokenHashingVectorizer(10000))
    ->apply(new BM25Transformer(1.2, 0.75));

I did a drop-in replacement of Word Count Vectorizer in the Sentiment demo and there was about a .05 drop in the F1 score. My guess is that this is do to noise introduced by colliding tokens. Note that unlike Python, we do not have a sparse array implementation like scipy so we can't just define an insanely large vector to avoid collisions as it will actually need the memory and compute.

Also, in 0.2.0, the TF-IDF Transformer will have variable Laplace (additive) smoothing. In previous versions it used add-one smoothing but now it is variable with add-one as the default. This parameter essentially borrows probability mass from the fitted vocabulary to give to words/tokens it has never seen before effectively reducing the impact of high inverse document frequencies. Setting smoothing to less than 1.0 (the default), however, will actually give some probability mass back to the fitted vocabulary.

https://docs.rubixml.com/en/0.2.0/transformers/tf-idf-transformer.html#parameters

Finally, it is possible to pre-normalize the samples using L1 or L2 normalizers. This would allow us to forego normalization in the Cosine kernel. What we could do is implement a 'Dot' distance kernel that only computes the dot product without normalization. I'd imagine this would shave a bit more time off both search and tree building.

Let me know your thoughts!

kroky commented 3 years ago

Andrew, thanks for all that info and new releases! I was a bit away from the topic but getting back in full speed. I am very near to the point where RubixML will be actually usable inside Tiki and our first model will be the MLT example we discuss. Your latest suggestions sound very good and we can play with them a bit more once the integration is in place as it will allow non-developers to build, train and execute models. Will share more details a I have them.

kroky commented 3 years ago

I just wanted to bring up the infinite loop problem as it occurred once again with v0.2.0 and the cosine kernel updates in place - that is returning proper results for zero vectors. I think we might have an additional edge case in BallTree::split method which we need to take care of. Have to debug it more, though. My actual dataset is a very small set (94 samples) of mostly testing data (vector dimension is just 42), so the vectors contain a lot of zeros but still, I believe the infinite loop should be handled more elegantly. Will debug more and get back to you.

kroky commented 3 years ago

Andrew, submitted a PR #118 regarding the infinite loop problem...

andrewdalpino commented 3 years ago

Just saw this now @kroky thanks for the PR, we also found similar bugs in a few other trees as well thanks to your excellent debugging skills

marclaporte commented 3 years ago

For the record, Rubix ML is now in Tiki, with MLT: https://doc.tiki.org/Machine-Learning

Congrats @andrewdalpino and @kroky !

Thank you @alaindesilets for your insight.

andrewdalpino commented 3 years ago

Since 0.4.0 we now can perform Latent Semantic Indexing to the document matrix.

use Rubix\ML\Transformers\HTMLStripper;
use Rubix\ML\Transformers\TextNormalizer;
use Rubix\ML\Transformers\WordCountVectorizer;
use Rubix\ML\Transformers\BM25Transformer;
use Rubix\ML\Transformers\TruncatedSVD;

$dataset = $dataset->apply(new HTMLStripper())
    ->apply(new TextNormalizer())
    ->apply(new WordCountVectorizer(10000, 3, 500))
    ->apply(new BM25Transformer(1.2, 0.75))
    ->apply(new TruncatedSVD(100));

A dimensionality reduction factor of 100 (10000 / 100) gives about the same order of speedup.

CC: @kroky @marclaporte

marclaporte commented 3 years ago

@andrewdalpino: This is fantastic! This is something we massively use, and performance is an important factor.

@kroky: Can you upgrade Tiki to use this new version?

Thanks!

kroky commented 3 years ago

Hey @andrewdalpino!

Thanks for the update. I tried to use it but fail with this error in Tensor extension: Matrix element must be an integer or floating point number, unknown type given.

The data is a sample one:

/home/vicy/work/tiki-git/vendor_bundled/vendor/rubix/ml/src/Transformers/TruncatedSVD.php:118:
array (size=1)
  0 => 
    array (size=43)
      0 => float 0.12831317492024
      1 => int 0
      2 => int 0
      3 => int 0
      4 => int 0
      5 => int 0
      6 => int 0
      7 => int 0
      8 => int 0
      9 => int 0
      10 => int 0
      11 => int 0
      12 => int 0
      13 => int 0
      14 => int 0
      15 => int 0
      16 => int 0
      17 => int 0
      18 => int 0
      19 => int 0
      20 => int 0
      21 => int 0
      22 => int 0
      23 => int 0
      24 => int 0
      25 => int 0
      26 => int 0
      27 => int 0
      28 => int 0
      29 => int 0
      30 => int 0
      31 => int 0
      32 => int 0
      33 => int 0
      34 => int 0
      35 => int 0
      36 => int 0
      37 => int 0
      38 => int 0
      39 => int 0
      40 => int 0
      41 => int 0
      42 => int 0

Not sure what is wrong with this data. If I run the same through the rubix/tensor php package it works fine. However, it seems the pecl extension is not working fine. Can you please check?

Also, noticed an issue with the rubix/ml php package - MissingExtension is misspelled in the filename like this: MIssingExtension.php, so it throws a fatal error instead of catchable MissingExtension error if pecl extension is not loaded. Didn't have time to submit a PR but thought it is worth mentioning.

lalit-powercozmo commented 9 months ago

Hello @andrewdalpino & @kroky I have working on the natural language searching in a ecommerce(php & Mysql) search like agolia , so here i am using the rubixml that but encounter when train the dataset can

Fatal error: Uncaught TypeError: Unsupported operand types: string - float in C:\laragon\www\machine-learning\vendor\rubix\ml\src\Kernels\Distance\Euclidean.php:47 Stack trace: #0 C:\laragon\www\machine-learning\vendor\rubix\ml\src\Classifiers\KNearestNeighbors.php(304): Rubix\ML\Kernels\Distance\Euclidean->compute(Array, Array)

Here is my code

require 'vendor/autoload.php';

// Import the necessary classes use Rubix\ML\Datasets\Labeled; use Rubix\ML\Extractors\SQLTable; use Rubix\ML\PersistentModel; use Rubix\ML\Persisters\Filesystem; use Rubix\ML\Pipeline; use Rubix\ML\Transformers\WordCountVectorizer; use Rubix\ML\Classifiers\KNearestNeighbors; use Rubix\ML\Extractors\ColumnPicker; use Rubix\ML\Datasets\Unlabeled; use Rubix\ML\Serializers\RBX; use Rubix\ML\Transformers\NumericStringConverter; use Rubix\ML\Transformers\OneHotEncoder; use Rubix\ML\Transformers\ZScaleStandardizer;

$host = 'localhost'; $db = 'powercozmo'; $user = 'root'; $password = '';

$dsn = "mysql:host=$host;dbname=$db;charset=UTF8";

try { $connection = new PDO($dsn, $user, $password); // Set the error mode to exception $connection->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION); // Echo a success message echo "Connected to the $db database successfully!"; } catch (PDOException $e) { // Echo an error message echo "Connection failed: " . $e->getMessage(); }

$query = $connection->prepare('SELECT id, name FROM products '); $query->execute(); $extractor = $query->fetchAll(PDO::FETCH_NUM);

$dataset = Labeled::fromIterator($extractor)->apply(new NumericStringConverter()) ->apply(new OneHotEncoder()) ->apply(new ZScaleStandardizer());

// Create a Word2Vec transformer with 100 dimensions and a window size of 2 $transformer = new WordCountVectorizer(100, 2);

// Create a KNN classifier with 3 neighbors $estimator = new KNearestNeighbors(3);

// Create a pipeline with the transformer and the estimator $pipeline = new Pipeline([$transformer], $estimator);

// Train the pipeline on the dataset $pipeline->train($dataset); // echo "

"; print_r($dataset); die();

// Save the trained pipeline to a file $persistor = new Filesystem('product-search.rbx'); $model = new PersistentModel($pipeline, $persistor); $model->save();

// Load the trained pipeline from the file $model = PersistentModel::load($persistor);

// Get the user input from the request $query = 'Looking for Water Turbine';

// Predict the label for the user query $prediction = $model->predictSample([$query]);

// Display the result echo $prediction . "\n";

kroky commented 9 months ago

I suppose the data you feed to the iterator or that comes from the extractor is not all numeric but contains strings and that's why the vector distance kernel throws the error - please check your data.

lalit-powercozmo commented 9 months ago

ok thanks will check