Memory & compute optimizations for sparse arrays

First of all, thank you for Rubix ML. It's an amazing piece of work!

I'm analyzing a bunch of short articles (avg token count: 8 per sample) for similarity/identicalness using TF-IDF, L1/L2 Normalizers, then Dot Product. I have around 12,000 pre-processed samples (min doc frequency = 2 and domain stopwords) that produce 27,000 vocabulary items. So I end up with a 27k square sparse samples array.

The problem starts with WordCountVectorizer where peak memory usage balloons to 12GB because of $template = array_fill(0, count($vocabulary), 0);, and then to 24GB after L1Normalizer->transform. My code is as follows:

function formatBytes($bytes, $precision = 2) {
    $units = array('B', 'KB', 'MB', 'GB', 'TB');

    $bytes = max($bytes, 0);
    $pow = floor(($bytes ? log($bytes) : 0) / log(1024));
    $pow = min($pow, count($units) - 1);

    $bytes /= pow(1024, $pow);

    return round($bytes, $precision) . ' ' . $units[$pow];
}

function bench($label, $start = false, $end = false) {
    if ( $start ) {
        $end = $end ?: microtime(TRUE);
    }

    echo "\e[0;36;40m$label: \e[0;33;40m", formatBytes(
        memory_get_peak_usage(false)
    ), "\e[0m", ( $start ? " \e[0;32;40m" . round( ($end-$start), 4 ) . "s \e[0m" : '' ), PHP_EOL;
}

$start0 = microtime(TRUE);
$rawSamples = file('samples.txt');
$dataset = new Unlabeled( $rawSamples );
$numSamples = $dataset->numSamples();
bench('Loaded ' . number_format($numSamples) . ' samples', $start0);

$start = microtime(TRUE);
$vectorizer = new WordCountVectorizer(200_000,
    0/$numSamples, // I've already dropped samples in pre-processing
    1.0);

$dataset->apply($vectorizer);
$dataset->numFeatures = count($vectorizer->vocabularies()[0]); // this part is not in the stock Rubmix ML install
bench('WordCountVectorizer (numFeatures: ' . number_format($dataset->numFeatures)  .')', $start);

$start = microtime(TRUE);
$dataset->apply(new L1Normalizer());
bench('L1Normalizer', $start);

$start = microtime(TRUE);
$dataset->apply(new TfIdfTransformer(1.0, false));
bench('TfIdfTransformer', $start);

$start = microtime(TRUE);
$dataset->apply(new L2Normalizer());
bench('L2Normalizer', $start);

bench('Total runtime', $start0);

Results, without optimization on 2.4 GHz 8-Core Intel Core i9 MBP 32GB RAM (showing peak memory usage and runtime):

(Note: Dataset is modified with a new variable $numFeatures that stores the number of features independent from the internal representation of $samples[0])

Now for the optimization, modify: https://github.com/RubixML/ML/blob/b0979df19a4d78b8beaed120dc630675c48978af/src/Transformers/WordCountVectorizer.php#L230

to $template = []; and re-run the code. Result:

Memory usage down to 20MB from 24GB, and the total runtime is in mere milliseconds: WordCountVectorizer->transform runs 283x faster L1Normalizer->transform runs 8,285x faster TfIdfTransformer->transform runs 1,636x faster L2Normalizer->transform runs 1,833x faster.

Cosine Distance or Dot Product will also run in a couple of minutes after other optimizations instead of hours, and I can actually run this code in production without worrying about costs or resources.

My question is, will there be a way forward for efficient sparse arrays? Or at least a baked in option that doesn't depend external libraries, but that utilizes the very nature of PHP's assoc arrays with its arbitrary integer keys.

Nice work @aiaf! I'll spend some time looking at your code as soon as I have some free time. I've considered implementing sparse arrays in the context of Tensor (https://github.com/Scien-ide/Tensor). The reason why I haven't yet is that I don't have enough free time. Your solution, albeit narrower in scope, seems like it would be simpler though and more realistic for the resources we have available.

Have a great day :)

Oh absolutely, @andrewdalpino. I'm lucky that 0s are meaningless to the problem & transformers I'm using. But I guess things start to get hairy in e.g. Stats::mean(). RubixML is huge and the ramifications are wide reaching. I'm relatively new to ML so I guess some feasibility questions are in order:

What percentage of problems in ML produce sparse matrices?
Do we need a new data structure e.g. Sample that stores information about the data independently from the internal representation, and how will that affect performance/memory vs packed PHP arrays?
Can we predict the degree of sparsity in our future samples before applying transforms and storing results in memory?
Which classes would need refactoring if this new data structure is to be introduced?

In the meantime I'm going to see if I can get https://github.com/RubixML/Sentiment to run faster/use less memory with the changes I've done.

Thanks.

Well done @aiaf, fingers crossed we can spin off some optimizations with your research

What percentage of problems in ML produce sparse matrices?

You'll find sparsity alot in bag-of-words and one-hot feature representations. I'm sure there are others are prone to sparsity as well. I don't know the proportion to other problems though ... many NLP tasks are starting to use dense embeddings nowadays.

Do we need a new data structure e.g. Sample that stores information about the data independently from the internal representation, and how will that affect performance/memory vs packed PHP arrays?

I think you're on the right track here. We need some type of data structure (ex. SparseArray) that exposes an API like a normal array but with sparsity handled under the hood. Normally, such a structure would store the non-zero values of the array and their offsets.

Can we predict the degree of sparsity in our future samples before applying transforms and storing results in memory?

Sparsity, in the context of bag-of-words, occurs when you have a large vocabulary and documents that only use a small subset of those words. I'm not sure how you would predict the degree of sparsity though.

Which classes would need refactoring if this new data structure is to be introduced?

We can maybe start small by using the data structure internally for some modules that tend to deal with sparsity (such as token counting). Having that said, it could potentially reach every aspect of the library. I look at SciKit Learn and the SciPy sparse array implementation - it appears that a great number of modules in SciKit can use sparse arrays under the hood.

If this is something that interests you, I'd encourage you to dive further. We have a Telegram channel that you may be interested in joining as well https://t.me/RubixML

Here's some more info

https://machinelearningmastery.com/sparse-matrices-for-machine-learning/

Thanks for your answers and your time.

Just spent a couple of hours trying hand at a SparseArray data structure that inherits directly from \SplFixedArray. It works well enough for my problem above. But the hole goes way deeper than my bandwidth allows at the moment; array type hints need to be dropped from all over the code or replaced by PHPDoc or PHP 8 union types (SparseArray | array). arraymap/array* functions no longer work and must be re-factored into foreach() and so on., or otherwise litter the code with instaceof checks. Perhaps introducing a general Rubix\ML\Array type (that SparseArray inherits from) is a better fit, that way array_map/merge/unique/walk/sum/etc can be re-factored into methods, and the code would work all over the place regardless of how our "arrays" are implemented.

Will keep you posted.

RubixML / ML

Memory & compute optimizations for sparse arrays #189