Remove gsl/nmatrix dependencies from Gemfile and build workflow

0xdevalias commented 3 years ago

This should be done after https://github.com/0xdevalias/devalias.net/pull/86 is merged.

Even when we removed Jekyll's --lsi option, the site build still only seemed to take ~7sec.

So maybe --lsi either isn't working, or just doesn't really have a big impact on our site build time.

In light of this.. i'm thinking we can leave the --lsi option enabled for now, but can probably remove the gsl/nmatrix optimisations we had added. Though we should probably do this in a follow up PR.

This also most likely renders https://github.com/0xdevalias/devalias.net/issues/83 as irrelevant.

Originally posted by @0xdevalias in https://github.com/0xdevalias/devalias.net/pull/86#issuecomment-663932727

0xdevalias commented 1 year ago

See also:

0xdevalias commented 1 year ago

:wave: Hi,

I stumbled onto this thread from https://github.com/jekyll/classifier-reborn/issues/193.

A few notes that you might find helpful:
You're not noticing any difference in build times with the --lsi option because your site (as it is today in this repo) doesn't use related posts (so the --lsi option does nothing). To use LSI, you need to call site.related_posts somewhere in a Liquid template. For example, you might add something like the following to _layouts/post.html:
{% for post in site.related_posts limit:3 %}
<p>{{ post.title }}</p>
{% endfor %}
When you call site.related_posts, if you don't pass the --lsi option, it's just recent posts.

If you are using site.related_posts and you pass the --lsi option, You'll see Populating LSI... in your jekyll build --lsi output. The build will be slow unless you have the gsl gem and native gsl library installed. I haven't experimented with nmatrix or narray at all, but simply using the gsl gem results in a ~500x speed increase for my use.
Hope that helps. I appreciated some of your comments on some of the libraries so I thought I'd share some notes with you!

Originally posted by @mkasberg in https://github.com/0xdevalias/devalias.net/issues/83#issuecomment-846468147

0xdevalias commented 2 weeks ago

classifier-reborn has supported an alternative to gsl since v2.3.0, which might be a good alternative to switch to here:

The referenced issue links from the Gemfile:

https://github.com/0xdevalias/devalias.net/issues/83

https://github.com/jekyll/classifier-reborn/issues/192

https://github.com/jekyll/classifier-reborn/pull/198

https://github.com/jekyll/classifier-reborn/releases/tag/v2.3.0

Support Numo Gem for performing SVD

https://jekyll.github.io/classifier-reborn/#dependencies

It is recommended to install either Numo or GSL to speed up LSI classification by at least 10x.

Numo is a set of Numerical Module gems for Ruby that provide a Ruby interface to LAPACK. If classifier detects that the required Numo gems are installed, it will make use of them to perform LSI faster.

Install LAPACKE

Ubuntu: apt-get install liblapacke-dev

macOS: brew install lapack

Install OpenBLAS

Ubuntu: apt-get install libopenblas-dev

macOS: brew install openblas

Install the Numo::NArray and Numo::Linalg gems. If you're using Bundler, add numo-narray and numo-linalg to your Gemfile. (If using Bundler on macOS, you should set the build config like bundle config set --global build.numo-linalg --with-openblas-dir=$(brew --prefix openblas) --with-lapack-lib="$(brew --prefix lapack)/lib".)

Ubuntu: gem install numo-narray numo-linalg

macOS: gem install numo-narray, gem install numo-linalg -- --with-openblas-dir=$(brew --prefix openblas) --with-lapack-lib="$(brew --prefix lapack)/lib"

https://github.com/mkasberg/classifier-reborn/pull/6

https://github.com/SciRuby/rb-gsl/issues/63

Originally posted by @0xdevalias in https://github.com/0xdevalias/devalias.net/issues/20#issuecomment-2179730548

The following posts by @mkasberg are also worth reading/considering before going too deep with this:

https://www.mikekasberg.com/blog/2022/07/12/how-i-updated-jekyll-classifier-reborn-for-ruby-3.html
- How I Updated jekyll/classifier-reborn for Ruby 3 (12 Jul 2022)
- This blog basically relates to the work linked above
https://www.mikekasberg.com/blog/2024/04/23/better-related-posts-in-jekyll-using-ai.html
- Better Related Posts in Jekyll Using AI (23 Apr 2024)
- https://github.com/mkasberg/jekyll_ai_related_posts
- Generate Jekyll related posts with AI
- When the plugin is installed and configured, it will populate an ai_related_posts key in the post data for all posts.
- The first time the plugin runs, it will fetch embeddings for all your posts. Based on some light testing, this took me 0.5 sec per post, or about 50 sec for a blog with 100 posts. All subsequent runs will be faster since embeddings will be cached.
- On an example blog with ~100 posts, this plugin produces more accurate results than classifier-reborn (LSI) in about the same amount of time.
- The API costs to use this plugin with OpenAI's API are minimal. I ran this plugin for all 84 posts on mikekasberg.com for $0.00 in API fees (1,277 tokens on the text-embedding-3-small model). (Your results may vary, but should remain inexpensive.)

Having ChatGPT explain the differences between using LSI and embeddings for this purpose:

Latent Semantic Indexing (LSI)

Method: Uses Singular Value Decomposition (SVD) on term-document matrices.

Representation: Lower-dimensional space capturing latent semantic structures.

Applications: Information retrieval, document clustering, text summarization.

Advantages: Handles synonymy, less computationally intensive.

Limitations: Limited in capturing complex linguistic phenomena, performance depends on the corpus.

Embeddings (e.g., OpenAI embeddings)

Method: Uses deep learning models like transformers.

Representation: Dense vectors capturing semantic meaning, context, and relationships.

Applications: Sentiment analysis, text classification, named entity recognition, question answering.

Advantages: Captures complex linguistic phenomena, state-of-the-art performance, versatile.

Limitations: Computationally intensive, requires significant resources, may need fine-tuning.

Summary

LSI is simpler and effective for basic tasks but less nuanced.

Embeddings provide richer, context-aware representations and superior performance on a wide range of tasks but require more computational power.

0xdevalias / devalias.net

Remove gsl/nmatrix dependencies from Gemfile and build workflow #87

Latent Semantic Indexing (LSI)

Embeddings (e.g., OpenAI embeddings)

Summary