getgrav / grav-premium-issues

Official Grav Premium Issues repository to report problems or ask questions regarding the Premium products offered.
https://getgrav.org/premium
7 stars 2 forks source link

[algolia-pro] Algolia index not updating with deleted records #359

Open guilhermeeric opened 1 year ago

guilhermeeric commented 1 year ago

When I delete a record, algolia indexing doesn't take that in account and stills lists the deleted record when I search for it. Clicking "reindex now" or "reset index" doesn't seem to do anything apparently. The only way to get rid of deleted records appearing on the search right now is to clear index directly on algolia UI. Is there something I am missing?

Currently on:

Grav v1.7.39.4 Admin v1.10.39 Algolia v1.0.8

Steps to reproduce:

1 - Create a new document 2 - See that the new document appears on search 3 - Delete the created document 4 - Reindex Algolia 5 - See that the deleted document still appears on search results

rhukster commented 1 year ago

Are you deleting the page in the grav admin ? Does "reset index" (red button) in the admin delete the entry as expected?

rhukster commented 1 year ago

Please try with the latest versions.. some improvements have been made that might impact this.

thekenshow commented 1 month ago

I'm having a similar issue, although my site setup includes Gantry so I'm indexing via CLI with:

bin/plugin algolia-pro index --url="https://mysite.com/sitemap.json"

What I want is to run a weekly indexing that ensures there are no pages staying in the index after they're unpublished on mysite.com.

I have Smart Indexing disabled, IIRC again because of Gantry. My understanding is that Smart Indexing only optimizes the number of API calls, so disabling it could raise costs but would not cause unpublished results to linger in the index. Is that correct?

Example My mysite.com site was returning a result for the string "5a241608b220bf0afc28e8a1ce0907b6", which was from a Flex object URL that was unpublished.

I ran bin/plugin algolia-pro index --url="https://mysite.com/sitemap.json", but the "5a241608b220bf0afc28e8a1ce0907b6" result persisted.

I logged into Algolia and searched the index directly, confirming that "5a241608b220bf0afc28e8a1ce0907b6" was still indexed.

I tried the following:

bin/plugin algolia-pro index --flush --url="https://mysite.com/sitemap.json"

But there was no change in the indexed result.

Finally, I logged into Algolia, cleared the index, and then ran:

bin/plugin algolia-pro index --url="https://mysite.com/sitemap.json"

The indexing appaered to complete successfully but it was not sent to Algolia. I tried two more times, still nothing showed in Algolia.

Finally, I included --flush and the index was restored on Algolia.

Two questions about --flush:

  1. Why wouldn't it clear an existing Algolia index?
  2. Why would it be required to update a cleared index?

The main question is still: How do I use the CLI to generate a fresh index once a week with no outdated pages in the index?

rhukster commented 1 month ago

in regards to smart indexing...

aloglia-pro keeps track of the 'chunks' of pages that are indexed. Every page is chopped up into chunks because algolia has a strict limit on the size of any item it can index. if the content is very small on a particular page/url, it might only take one chunk, but typically for regular sized articles it will take several chunks. the plugin keeps track of a hash of the chunks, so if it thinks a particular chunk is already indexed becaues it matches the existing hash exactly, it won't send that chunk to be replace the current one. That's it really.

Now deleting pages is another issue.. when you remove a page in the admin, algolia-pro knows its a delete and sends a call to algolia to remove all indexed chunks of that page. If you don't use admin and simply delete a page, algolia has no clue, and continues to assume the page does exist. Even if you reindex, it won't remove that page, because it only adds to the index. The way to handle that is to "flush" the index with the -f option:

➜ bin/plugin algolia-pro help index
Description:
  Algolia Pro Indexer

Usage:
  index [options]

Options:
  -f, --flush            optionally flush the existing search indexes rather than updating
  -r, --raw              Raw unformatted results
  -q, --quiet            Do not output any message
  -u, --url=URL          Optional URL of JSON sitemap (CrawlPageSearch only)
      --route=ROUTE      Optional route of a single specific page to index (GravPageSearch only)
  -x, --indexes=INDEXES  Optional comma-separated list of enabled index configurations to use
  -h, --help             Display this help message
  -V, --version          Display this application version
      --ansi             Force ANSI output
      --no-ansi          Disable ANSI output
  -n, --no-interaction   Do not ask any interactive question
      --env[=ENV]        Use environment configuration (defaults to localhost)
      --lang[=LANG]      Language to be used (defaults to en)
  -v|vv|vvv, --verbose   Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug

Help:
  The index command re-indexes the Algolia search engine

So adding -f should flush/remove the existing index before re-indexing. This will ensure a 'fresh' copy.

One thing to mention, is that sometimes things take a little while to show up in Algolia. I think this is what you are seeing. You are indexing and thinking it's not there, but it has to process everything. There's somehwere in the Algolia dashboard that shows the state of the indexing.

At first I thought perhaps you had production_mode: false set because that will not send anything to Algolia, does all the processing locally on Grav only. But as it did show up, i'm sure it was related to the delay.

thekenshow commented 1 month ago

Thanks. So my understanding is:

Is that correct?

rhukster commented 1 month ago

yes there's a delay after any indexing. The data is on the algolia side, but it takes some time to actually show up there in their systems. Mainly because its a highly distributed search engine, and it has to trickle through to all their nodes.