Algolia publish objectID question

bsubedi26 commented 5 years ago

Hello, is it possible to specify the objectID value in the front matter header for the markdown files being published to algolia instead of automatically generating an UUID as the objectID?

One reason as stated in the algolia (https://www.algolia.com/doc/guides/indexing/structuring-your-data/#unique-identifier---objectid):

If you don’t provide an objectID, Algolia will generate one automatically. However, it will be easier to remove or update records if you have stored a unique identifier in the objectID attribute.

Another reason for this is we're using jekyll-import (https://github.com/jekyll/jekyll-import) to import articles from drupal to jekyll, then publishing the jekyll markdown files to algolia using this plugin. While we're working on removing the drupal system, we need a way to keep the old articles in sync with the new system - its happening this way: drupal -> jekyll -> algolia. One issue is, if an article was updated in drupal, and imported to jekyll and then synced to algolia. Then, there would be duplicates for that same article since the objectID value are different.

What is the current behavior?

The current behavior is the algolia plugin generated an UUID for a record automatically. https://github.com/algolia/jekyll-algolia/blob/a5bf8f6089d9cfb0e133b7f071f172e877ba3254/lib/jekyll/algolia/extractor.rb#L47

What is your expected behavior?

If the objectID field is specified in the front matter header, this plugin ignore that value and generates an uuid for that record. Is there a way to use the id value specified in the front matter header instead of generating an UUID if a markdown file has an objectID field value?

Can it work similar to the npm package algoliasearch? With algoliasearch, if you push data to algolia (using addObjects() method) it will use the objectID field in the data instead of generating a random UUID.

pixelastic commented 5 years ago

Hello and thanks for the comments,

I think there is some confusion about what the plugin does in regard to objectIDs, so let me clarify that first. The plugin does not create one record for each page, it actually creates several records per page (default is one record per paragraph of text). You will have much more records in your Algolia index than you have pages (I did some estimates here).

In addition to that, the plugin also goes one step further than our API clients. The objectID of each record is actually a md5 hash of the record content. This means that whenever you fix a typo in a paragraph, the objectID for that record will change. The indexing done by the plugin takes that into account, and instead of pushing all records to the Algolia index, it actually first do a diff between the about-to-be-pushed records and the one already in the remote index. It will delete old records and add new ones, but will keep untouched the records that are the same on both sides. I did that to dramatically cut down on the number of operations required.

So, to get back to your initial problem, I think you will be ok with the default behavior. Whenever you import your Drupal content into your Jekyll website, it should only update a few pages, and running jekyll algolia will in turn only update the records that changed.

You shouldn't have any duplicated content, but if you do, could you post a link to a repo where I could reproduce the issue?

bsubedi26 commented 5 years ago

@pixelastic Thanks for the feedback. jekyll algolia with its default behavior is working great right now and there aren't any duplicate content.

Closing this issue!

algolia / jekyll-algolia

Algolia publish objectID question #107

What is the current behavior?

What is your expected behavior?