Open 0xdevalias opened 3 years ago
:wave: Hi,
I stumbled onto this thread from https://github.com/jekyll/classifier-reborn/issues/193.
A few notes that you might find helpful:
- You're not noticing any difference in build times with the
--lsi
option because your site (as it is today in this repo) doesn't use related posts (so the--lsi
option does nothing). To use LSI, you need to callsite.related_posts
somewhere in a Liquid template. For example, you might add something like the following to_layouts/post.html
:{% for post in site.related_posts limit:3 %} <p>{{ post.title }}</p> {% endfor %}
- When you call
site.related_posts
, if you don't pass the--lsi
option, it's just recent posts.- If you are using
site.related_posts
and you pass the--lsi
option, You'll seePopulating LSI...
in yourjekyll build --lsi
output. The build will be slow unless you have the gsl gem and native gsl library installed. I haven't experimented with nmatrix or narray at all, but simply using the gsl gem results in a ~500x speed increase for my use.Hope that helps. I appreciated some of your comments on some of the libraries so I thought I'd share some notes with you!
Originally posted by @mkasberg in https://github.com/0xdevalias/devalias.net/issues/83#issuecomment-846468147
classifier-reborn
has supported an alternative to gsl
since v2.3.0
, which might be a good alternative to switch to here:
The referenced issue links from the
Gemfile
:
- https://github.com/0xdevalias/devalias.net/issues/83
- https://github.com/jekyll/classifier-reborn/issues/192
- https://github.com/jekyll/classifier-reborn/pull/198
- https://github.com/jekyll/classifier-reborn/releases/tag/v2.3.0
Support Numo Gem for performing SVD
- https://jekyll.github.io/classifier-reborn/#dependencies
It is recommended to install either Numo or GSL to speed up LSI classification by at least 10x.
Numo is a set of Numerical Module gems for Ruby that provide a Ruby interface to LAPACK. If classifier detects that the required Numo gems are installed, it will make use of them to perform LSI faster.
- Install LAPACKE
- Ubuntu:
apt-get install liblapacke-dev
- macOS:
brew install lapack
- Install OpenBLAS
- Ubuntu:
apt-get install libopenblas-dev
- macOS:
brew install openblas
- Install the Numo::NArray and Numo::Linalg gems. If you're using Bundler, add
numo-narray
andnumo-linalg
to your Gemfile. (If using Bundler on macOS, you should set the build config likebundle config set --global build.numo-linalg --with-openblas-dir=$(brew --prefix openblas) --with-lapack-lib="$(brew --prefix lapack)/lib"
.)
- Ubuntu:
gem install numo-narray numo-linalg
- macOS:
gem install numo-narray
,gem install numo-linalg -- --with-openblas-dir=$(brew --prefix openblas) --with-lapack-lib="$(brew --prefix lapack)/lib"
- https://github.com/mkasberg/classifier-reborn/pull/6
- https://github.com/SciRuby/rb-gsl/issues/63
Originally posted by @0xdevalias in https://github.com/0xdevalias/devalias.net/issues/20#issuecomment-2179730548
The following posts by @mkasberg are also worth reading/considering before going too deep with this:
How I Updated jekyll/classifier-reborn for Ruby 3 (12 Jul 2022)
Better Related Posts in Jekyll Using AI (23 Apr 2024)
Generate Jekyll related posts with AI
When the plugin is installed and configured, it will populate an
ai_related_posts
key in the post data for all posts.
The first time the plugin runs, it will fetch embeddings for all your posts. Based on some light testing, this took me 0.5 sec per post, or about 50 sec for a blog with 100 posts. All subsequent runs will be faster since embeddings will be cached.
On an example blog with ~100 posts, this plugin produces more accurate results than classifier-reborn (LSI) in about the same amount of time.
The API costs to use this plugin with OpenAI's API are minimal. I ran this plugin for all 84 posts on mikekasberg.com for $0.00 in API fees (1,277 tokens on the text-embedding-3-small model). (Your results may vary, but should remain inexpensive.)
Having ChatGPT explain the differences between using LSI and embeddings for this purpose:
Latent Semantic Indexing (LSI)
- Method: Uses Singular Value Decomposition (SVD) on term-document matrices.
- Representation: Lower-dimensional space capturing latent semantic structures.
- Applications: Information retrieval, document clustering, text summarization.
- Advantages: Handles synonymy, less computationally intensive.
- Limitations: Limited in capturing complex linguistic phenomena, performance depends on the corpus.
Embeddings (e.g., OpenAI embeddings)
- Method: Uses deep learning models like transformers.
- Representation: Dense vectors capturing semantic meaning, context, and relationships.
- Applications: Sentiment analysis, text classification, named entity recognition, question answering.
- Advantages: Captures complex linguistic phenomena, state-of-the-art performance, versatile.
- Limitations: Computationally intensive, requires significant resources, may need fine-tuning.
Summary
- LSI is simpler and effective for basic tasks but less nuanced.
- Embeddings provide richer, context-aware representations and superior performance on a wide range of tasks but require more computational power.
This should be done after https://github.com/0xdevalias/devalias.net/pull/86 is merged.
Originally posted by @0xdevalias in https://github.com/0xdevalias/devalias.net/pull/86#issuecomment-663932727