RubixML / ML

A high-level machine learning and deep learning library for the PHP language.
https://rubixml.com
MIT License
2.04k stars 184 forks source link

Stemming handlers (start with porter?) #60

Closed raijyan closed 4 years ago

raijyan commented 4 years ago

Would be great if we could have stemmers addition to "other" section to reduce dimensionality of NLP. Cutting down some wasted memory/processing time from things like plurals and generating stronger links for the TfIdf transformer. Usually applied after basic normalisation and stop words.

Would imagine something like it becoming a 4th option of the WordCountVectorizer. Though for processing it'd make sense for it to kick in during the tokanize method eg in NGram before it stitches back together the split word tokens.

Examples that'd be easy to drop in found at https://tartarus.org/martin/PorterStemmer/php.txt and https://github.com/angeloskath/php-nlp-tools/blob/master/src/NlpTools/Stemmers/PorterStemmer.php

^ tartarus.org/martin being the the home of the author of the Porter algorithm.

If more adventurous there's a bunch of multi-language examples at https://github.com/wamania/php-stemmer (could be added as a composer dependancy?)

andrewdalpino commented 4 years ago

@raijyan I think this is a great idea and I've considered it before myself

One of my concerns was with non-English use cases. I like the idea of a stemming tokenizer for the reason you've mentioned but also because it wouldn't require another argument to Word Count Vectorizer.

https://github.com/wamania/php-stemmer seems like it can be integrated into a tokenizer quite easily. We could have a single Stemmer tokenizer that wraps one of the other tokenizers (NGram, SkipGram, Word, etc.) and stems their output, or if that is not possible we could implement a stemming version of each tokenizer.

I am considering a 'Rubix ML Extras' repository and package that would include experimental features such as obscure transformers, neural network activation functions, and perhaps stemmers. The hope is that we have enough hardcore users that will install and experiment with these features before (and if) we included them into the main package.

We are currently in a 'feature freeze' until our first stable release (we just put out our first release candidate this week) which means we do not plan to add additional functionality until after then. Only optimizations, bugfixes, and mayyyyyyybe a small feature. However we are free to develop an 'Extras' package in the meantime.

I'd love to hear your thoughts

Do you or someone you know have proficiency with stemmers?

Thanks for the great recommendation and information!

simplechris commented 4 years ago

Just lurkin' around, but yeah I've ported all of the stemmers/tokenizers etc (including PorterStemmer) from lucene. I agree that it probably belongs in an 'extras' or other external package if you want tighter integration into Rubix

raijyan commented 4 years ago

Tested adding Wamania\Snowball to my dependancies - on my product descriptions dataset (4,000 products) went from:

$vectorizer = new WordCountVectorizer(50000, 3, new NGram(1, 1));
$dataset->apply($vectorizer);
$dataset->apply(new TfIdfTransformer());

echo 'Memory: ' . memory_get_usage() / 1024 / 1024 .'M'. PHP_EOL;
echo 'Tokens: ' . count($vectorizer->vocabularies()[0]) . PHP_EOL;

print_r(array_slice($vectorizer->vocabularies()[0], 0, 40));

Memory: 1448.11M
Tokens: 3575
Array
(
    [0] => style
    [1] => size
    [2] => this
...

to:

use Wamania\Snowball\StemmerFactory;
$vectorizer = new WordCountVectorizer(50000, 3, new NGram(1, 1), StemmerFactory::create('english'));
$dataset->apply($vectorizer);
$dataset->apply(new TfIdfTransformer());

echo 'Memory: ' . memory_get_usage() / 1024 / 1024 .'M'. PHP_EOL;
echo 'Tokens: ' . count($vectorizer->vocabularies()[0]) . PHP_EOL;

print_r(array_slice($vectorizer->vocabularies()[0], 0, 40));

Memory: 853.78M
Tokens: 2680
Array
(
    [0] => style
    [1] => size
    [2] => this
...

So some savings to be made, at least for my use cases. Should cut a few hours off my training times on a jaccard.

An extras setup would be cool if you're pushing for a feature freeze. Would probably look at putting in a lemmatizer/a locality normaliser too then (darn variants of English).

Loving the library so far though, working though moving my existing production NLP over to it then time for some experiments >:)

raijyan commented 4 years ago

We could have a single Stemmer tokenizer that wraps one of the other tokenizers (NGram, SkipGram, Word, etc.) and stems their output, or if that is not possible we could implement a stemming version of each tokenizer.

Yeah a wrapper might help, currently just tacked it in as:

public function tokenize(string $string, $stemmer = null) : array
...
    $nGram = $stemmer ? $stemmer->stem($word) : $word;
...

Not quite as clean as i'd like has done the trick for getting it up and running. Getting some nice results using NGram over my old php-nlp/php-ai combo with single word tokens.

A slight change to the structure would be cool if it allowed for fuller use of the multi-dictionary setup you've made. Could then set token configuration per defined dictionary from the column picker. EG: being able to configure that my tags/attributes are single word tokens, but my titles/descriptions are NGram(1, 3) when it iterates over them and builds the dictionaries to be used for vectors would offer further performance improvement for... lazy... datasets.

andrewdalpino commented 4 years ago

@raijyan @simplechris

We went ahead and created an Extras package that can be installed (composer require rubix/extras) right now as dev-master

Included is the Word Stemmer which can be used alone, or as the base tokenizer for either N-Gram or Skip Gram. Example below ...

use Rubix\ML\Transformers\WordCountVectorizer;
use Rubix\ML\Other\Tokenizers\NGram;
use Rubix\ML\Other\Tokenizers\WordStemmer;

$transformer = new WordCountVectorizer(10000, 3, new NGram(1, 2, new WordStemmer('english')));

The changes to N-Gram and Skip Gram have not been released yet but you can install the latest dev-master to preview the features.

In addition, we've added Delta TF-IDF Transformer which is a supervised TF-IDF transformer that boosts term frequencies by how unique they are to a particular class not just the entire corpus.

Preliminary tests using the Sentiment example and the new Word Stemmer as the base tokenizer for N-Gram show no noticeable improvement in accuracy or training speed, however, your mileage may vary. Let me know how it works for you.

With that, we now have a standard way to introduce experimental features in Rubix ML. Feel free to suggest features or contribute to the development of the project if you are so willing.