Need a more performant way to bulk generate embeddings for terms

dkotter commented 2 months ago

Describe the bug

In v2.2.0 of ClassifAI we added the ability to classify content within your own terms using OpenAI Embeddings. In order for this to work, we need embedding data to be generated for each term and for the post we are comparing those terms to.

The post embedding data is generated on the fly when the comparison is triggered but we don't want to do that for terms as we may have hundreds or thousands. So these are generated in bulk when the feature is first set up. This has always been a known limitation, that if you have lots of terms, this process will probably run into either timeouts, memory issues or rate limit issues with OpenAI.

In #758, we are doing some changes to how OpenAI Embeddings work but we have not yet fixed this issue, so ideally that is fixed and added to the same release (as these changes require all embeddings to be regenerated).

There are two issues I'm currently aware of:

We generate these embeddings when the settings are saved. But we only generate embeddings for taxonomies that are turned on. So the first time you save, the taxonomy settings aren't saved yet so we don't run anything. You have to save again for things to work
We generate embeddings for each term that doesn't currently have an embedding saved during this process (which again, fires when the settings are saved). For sites with 1000+ terms, this will almost certainly lead to timeouts or memory issues. Sites with far fewer terms will probably run into OpenAI rate limits

Ideally we would introduce some sort of queue management system to address this, ideally making this a general enough solution that it can be used by other features that may come in the future. There are tools out there we could look to use, like Action Scheduler or Cavalcade, but we may be fine just building a lightweight system on top of the scheduled event system in WordPress.

Steps to Reproduce

Setup the Classification Feature with OpenAI Embeddings as the Provider
Turn on at least one taxonomy and hit save
Notice that no embeddings are actually generated
Hit save again and notice the embeddings get generated

Can also generate 1000+ terms and try running this process again, though note this will cost money since it makes API requests. I've tested locally using an embeddings model run through Ollama and at around 1000 terms, I run into memory issues

Screenshots, screen recording, code snippet

No response

Environment information

No response

WordPress information

No response

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Sidsector9 commented 1 month ago

I've investigated both Action Scheduler and Cavalcade and found that the latter requires disabling WP-Cron. For this reason I think Action Scheduler is a more reasonable candidate.

I have a branch with Action Scheduler implemented, however I'm facing some PHP memory exhaustion errors. It is intermittent, but I suspect it has to do with scheduling jobs inside the for() loop. I'll fix that and push the branch this week.

dkotter commented 1 month ago

@Sidsector9 Worth noting that on a different (private) project, @iamdharmesh implemented https://github.com/deliciousbrains/wp-background-processing to solve this, so that's another tool we can look into. I know he compared that to Action Scheduler and had a few reasons why he decided to use that one, so may be worth talking to him

Sidsector9 commented 2 weeks ago

@dkotter Dharmesh and I discussed this last week and concluded that either/or is a good choice as both has its pros and cons.

I decided to go ahead with Action Scheduler to align with Woo's decision to migrate all the background process related jobs to AS. Related: https://github.com/woocommerce/woocommerce/issues/44246

10up / classifai