algolia / gatsby-plugin-algolia

A plugin to push to Algolia based on graphQl queries
https://yarn.pm/gatsby-plugin-algolia
Apache License 2.0
177 stars 45 forks source link

Reduce build operations #9

Closed Haroenv closed 4 years ago

Haroenv commented 6 years ago

A continuation of #5 and #1

What happens currently is:

  1. Gatsby builds
  2. the extraction logic runs and turns this into objects
  3. these objects are indexed into a separate index
  4. this separate index replaces the original index

We should use atomic-algolia or algolia-indexing preferably to make this into a flow where no extra index needs to be created, but rather we can index only the changes since last push, rather than all.

cc @pixelastic (when you're back from parental leave of course)

pixelastic commented 5 years ago

Good call. I would suggest using algolia-indexing instead of atomic-algolia (used in production on TalkSearch and with better test suite).

algolia-indexing currently only implements what I call "full atomic" indexing. This will make sure to usz as few operations as possible (by only applying a diff of changes), but to do so in an atomic way, it requires a plan that can hold twice the number of records actually used.

I planned on implementing another mode, called "live diff" that will be similar, the only difference being that it won't be atomic (making the diff live on the production index), still using as few operations as possible, but not needing a large plan.

Both modes have their merits, it's all a question of trade-offs. Considering that your current implementation already requires a plan that can hold twice the number of records, I think going with a full atomic can only be an improvement and implementing the live diff can wait.

Haroenv commented 5 years ago

What I did now is already fairly close to "full atomic" I think, but taking the generated objects as source of truth (create temp index and switch), so not exactly worth the "effort" to switch. But it would be nice if a live diff (since everything always has hashes in graphQL this would be possible to leverage). Is this something you have the bandwidth for to collaborate on @pixelastic?

pixelastic commented 5 years ago

What you have now still consumes a lot of operations (as you need to re-push all the records to a tmp index on each push). Switching to algolia-indexing would drastically reduce this usage.

I tried to make the package as easy as possible to use (there is one method to call with credentials, settings and records, everything else is automated), as to reduce the amount of effort needed for a switch, but I'd be interested in knowing how I could make this even easier.

Or maybe we're talking about the same thing with different names. Maybe what you call live diff is what I call full atomic :)

u12206050 commented 5 years ago

Any update on this? I am using way to many operations between builds and 99% of all the data indexed is still the same. Any info on how I could manually implement this "algolia-indexing" you are talking about? links or docs could be helpful :) Thanks though for the plugin.

Haroenv commented 5 years ago

There was no update because nobody commented here in months, so I worked on other things. Are you interested in contributing here? I can give some pointers where to start.

u12206050 commented 5 years ago

Yes sure, I would really want this to work so can help out :)

Haroenv commented 5 years ago
  1. Find out where long-term storage can be done (somewhere in gatsby’s cache / somewhere on the file system / in a different Algolia index)
  2. When indexing, compute a hash of each object
  3. Before indexing, compare the computed hash with the next coming hash (Map with objectId: oldHash) 4 only index deleted / added / modified objects
u12206050 commented 5 years ago

I tried to search algolia-indexing and came to mainly to the Algolia Docs, but I wouldn't know how to or where to start doing changes within this plugin to accomplish what you guys mentioned. I am running my builds on Netlify so the only thing I can use it netlify cache to keep track of indexed objects.

coreyward commented 5 years ago

@Haroenv That approach will still try to index every object on every environment/machine that this website is built on. For common deployment targets like Netlify that periodically clear the build cache anyways you're going to be making excessive calls routinely. Algolia ought to offer a way of making this easier.

@u12206050 For what it's worth, I just went with another approach that updates Algolia via an external process instead of using this plugin. In hindsight, trying to couple indexing with build didn't make sense for a structured object search like I have anyways; if you're in a similar situation, that may be much less work.

u12206050 commented 5 years ago

Ok so from the sounds of it I need some external key:hash storage space that I can query to check before indexing objects since Netlify's cache gets cleared. I'll see if I can first implement a fork that uses Netlify's cache or as a function that is optional whereby anyone can give the hash for a given object key.

u12206050 commented 5 years ago

Am I correct in assuming that in the current state of the plugin if I simply filter out what has changed it then only adds those objects to the ${indexName}_tmp and then overwrites the existing index once done meaning that only the changed objects will actually be in Algolia and everything else that didn't change will be lost?

Meaning I have to remove that piece of code and update the main index directly?

Haroenv commented 5 years ago

If you do it that way, there will be a flash of wrong or no results

u12206050 commented 5 years ago

I've made a pull request. It now supports a generic hash version that will only update objects that have changed. Works well on Netlify as long as the cache persists. Once cache is removed it updates everything again.

pixelastic commented 5 years ago

@Haroenv I think the algolia-indexing project would be the best place to start. It is still a beta and heavy work in progress, but it does solve a few of the issues mentioned in this thread. It uses the Algolia indexes and records themselves to do a smart diff between what is already in the index and what is about to be pushed to reduce the number of operations used.

As a full disclosure, I no longer work at Algolia, but I intend to keep working on algolia-indexing when time permits, to improve it even further. The version currently can be greatly improved (see the issues for an explanation)

u12206050 commented 5 years ago

Thanks, I have removed my previous pull request and made a new one using Algolia to check for updates. It compares specified fields to see if an object should be updated, inserted, removed or just ignored.

fraserisland commented 5 years ago

This would be great to get in! I also had to move to an external process due to excess records being indexed when they were basically all the same.

danvernon commented 5 years ago

Thanks, I have removed my previous pull request and made a new one using Algolia to check for updates. It compares specified fields to see if an object should be updated, inserted, removed or just ignored.

Did this get pushed?

Haroenv commented 5 years ago

It has not been published yet (sorry), but as far as I can tell @u12206050 has published his fork on npm: https://yarnpkg.com/en/package/gatsby-plugin-algolia-search

danvernon commented 5 years ago

@Haroenv thanks for the quick update - i followed the instructions, but can see my operations are increasing with every build - the idea of this was that it would only need to update changed records right?

u12206050 commented 5 years ago

@danvernon Have you tried this: gatsby-plugin-algolia-search)

danvernon commented 5 years ago

@u12206050 yes thats why I just implemented - its doing about 800 actions per build. I have 628 records. Heres my code.

{
      resolve: `gatsby-plugin-algolia-search`,
      options: {
        appId: process.env.GATSBY_ALGOLIA_APP_ID,
        apiKey: process.env.ALGOLIA_ADMIN_KEY,
        queries,
        chunkSize: 10000, // default: 1000
        enablePartialUpdates: true, // default: false
        matchFields: ['slug', 'modified'], // Array<String> default: ['modified']
      },
    }
const productQuery = `{
  products: allShopifyProduct {
    edges {
      node {
        objectID: id
        title
        handle
        description
        images {
          originalSrc
        }
        variants {
          price
        }
      }
    }
  }
}`

const flatten = arr =>
  arr.map(({ node: { ...rest } }) => ({
    ...rest,
  }))

const settings = {
  attributesToSnippet: [`description:20`],
}

const queries = [
  {
    query: productQuery,
    transformer: ({ data }) => flatten(data.products.edges),
    indexName: `Products`,
    settings,
    matchFields: ['slug', 'modified'], // Array<String> overrides main match fields, optional
  },
]

module.exports = queries
u12206050 commented 5 years ago

It needs both the slug and modified field for comparing, if you don't have those fields change the matchFields in options to something like updated and then fetch the updated field from your source:


const productQuery = `{
  products: allShopifyProduct {
    edges {
      node {
        objectID: id
        title
        updated
        handle
        description
        images {
          originalSrc
        }
        variants {
          price
        }
      }
    }
  }
}```
danvernon commented 5 years ago

@u12206050 i dont have slug, so i can just change for - matchFields: ['handle', 'updatedAt'] yeah?

u12206050 commented 5 years ago

Yeh, that looks good.

danvernon commented 5 years ago

@u12206050 hrmm not sure this is still working as intended - it seemed to work when i pushed a build from code, when the hook from changing 1 product - it seemed to take up the 800 actions again.

u12206050 commented 5 years ago

Hmm that is strange, I can assure you it should work though as we have been using this for months now without fail. We check a date field modified and if/when that value changes then only that post gets updated. One thing it could be is that if you are using the url to check, make sure it doesn't change between development and production environments or just remove it from the matchFields

Haroenv commented 5 years ago

It could also be that you are modifying every object on build

Haroenv commented 4 years ago

This has been implemented in 0.8.0 as enablePartialUpdates, thanks @u12206050 :)