0xdevalias / devalias.net

Source for devalias.net
http://www.devalias.net/
48 stars 10 forks source link

Remove gsl/nmatrix dependencies from Gemfile and build workflow #87

Open 0xdevalias opened 3 years ago

0xdevalias commented 3 years ago

This should be done after https://github.com/0xdevalias/devalias.net/pull/86 is merged.

Even when we removed Jekyll's --lsi option, the site build still only seemed to take ~7sec.

So maybe --lsi either isn't working, or just doesn't really have a big impact on our site build time.

In light of this.. i'm thinking we can leave the --lsi option enabled for now, but can probably remove the gsl/nmatrix optimisations we had added. Though we should probably do this in a follow up PR.

This also most likely renders https://github.com/0xdevalias/devalias.net/issues/83 as irrelevant.

Originally posted by @0xdevalias in https://github.com/0xdevalias/devalias.net/pull/86#issuecomment-663932727

0xdevalias commented 1 year ago

See also:

0xdevalias commented 1 year ago

:wave: Hi,

I stumbled onto this thread from https://github.com/jekyll/classifier-reborn/issues/193.

A few notes that you might find helpful:

  • You're not noticing any difference in build times with the --lsi option because your site (as it is today in this repo) doesn't use related posts (so the --lsi option does nothing). To use LSI, you need to call site.related_posts somewhere in a Liquid template. For example, you might add something like the following to _layouts/post.html:
    {% for post in site.related_posts limit:3 %}
    <p>{{ post.title }}</p>
    {% endfor %}
  • When you call site.related_posts, if you don't pass the --lsi option, it's just recent posts.
  • If you are using site.related_posts and you pass the --lsi option, You'll see Populating LSI... in your jekyll build --lsi output. The build will be slow unless you have the gsl gem and native gsl library installed. I haven't experimented with nmatrix or narray at all, but simply using the gsl gem results in a ~500x speed increase for my use.

Hope that helps. I appreciated some of your comments on some of the libraries so I thought I'd share some notes with you!

Originally posted by @mkasberg in https://github.com/0xdevalias/devalias.net/issues/83#issuecomment-846468147

0xdevalias commented 2 weeks ago

classifier-reborn has supported an alternative to gsl since v2.3.0, which might be a good alternative to switch to here:

The referenced issue links from the Gemfile:

Originally posted by @0xdevalias in https://github.com/0xdevalias/devalias.net/issues/20#issuecomment-2179730548

The following posts by @mkasberg are also worth reading/considering before going too deep with this:

Having ChatGPT explain the differences between using LSI and embeddings for this purpose:

Latent Semantic Indexing (LSI)

  • Method: Uses Singular Value Decomposition (SVD) on term-document matrices.
  • Representation: Lower-dimensional space capturing latent semantic structures.
  • Applications: Information retrieval, document clustering, text summarization.
  • Advantages: Handles synonymy, less computationally intensive.
  • Limitations: Limited in capturing complex linguistic phenomena, performance depends on the corpus.

Embeddings (e.g., OpenAI embeddings)

  • Method: Uses deep learning models like transformers.
  • Representation: Dense vectors capturing semantic meaning, context, and relationships.
  • Applications: Sentiment analysis, text classification, named entity recognition, question answering.
  • Advantages: Captures complex linguistic phenomena, state-of-the-art performance, versatile.
  • Limitations: Computationally intensive, requires significant resources, may need fine-tuning.

Summary

  • LSI is simpler and effective for basic tasks but less nuanced.
  • Embeddings provide richer, context-aware representations and superior performance on a wide range of tasks but require more computational power.