[Idea] Batch pre-processing for derived data

TikiTDO commented 4 years ago

The chewy ES adapter has the concept of crutches, which is really great for avoiding N+1 queries for derived data.

The search_import scope feature of searchkick solves some of these same problems, but it's not always the best tool for the job, particularly when the derived data may necessitate loading a huge number of has_many relations. In those cases I find that it's faster and easier to write my own code to get the data using SQL.

As an example, here I'm finding relations from a grandparent, through a polymorphic association (it's all using UUIDs to avoid other collisions), which is an order of magnitude faster than using includes.

::ExampleModel.where(id: some_collection).joins(%(
  LEFT JOIN "another_mode" ON
    "another_mode"."polymorphic_id" = "example_model"."parent_id" AND
    "another_mode"."submitted_by_id" = "example_model"."user_id"
  ))
  .where(AnotherModel.arel_table[:polymorphic_id].not_eq(nil))
  .pluck(::ExampleModel.arel_table[:id], AnotherModel.arel_table[:id]).to_h

I'd like to do something similar with searchkick. To accomplish this I'd like to propose adding an extra derived_data callback, which would live in bulk_indexer.rb.

Here's a rough example of what it would look like.

# bulk_indexer.rb

  def bulk_index(records)
    derived_data = index.derived_data(records) if index.respond_to?(:derived_data)
    Searchkick.indexer.queue(records.map { |r| RecordData.new(index, r, derived_data).index_data })
  end

  def bulk_update(records, method_name)
    derived_data = index.derived_data(records) if index.respond_to?(:derived_data)
    Searchkick.indexer.queue(records.map { |r| RecordData.new(index, r, derived_data).update_data(method_name) })
  end

This data could then be passed into the data method inside recrord_data.rb

# record_data.rb
  attr_reader :index, :record, :derived_data

  # ...

  def search_data(method_name = nil)
    partial_reindex = !method_name.nil?

    data_method = record.method(method_name || :search_data)
    source = data_method.arity > 1 ? data_method.call : data_method.call(derived_data)

We can put all this together into a PR with some tests, but I wanted to first check if this is something you'd be open to, or if you have another idea how to accomplish this task without more changes.

ankane commented 4 years ago

Hey @TikiTDO, thanks for the suggestion! Let me think about this one. In the past, I've taken an approach similar to performant conversions. The derived data idea has the benefit speeding up bulk reindexing without the need for a cache, but reindexing individual records might still be expensive.

TikiTDO commented 4 years ago

Yeah, this is purely a bulk re-index optimization. I already have some ideas how to use your existing conversion functionality to boost some of our slower individual indexes, but this is a slightly different scenario, mostly due to the fact that we have to re-index huge numbers of records fairly frequently.

Incidentally, the underlying use case was moving from chewy to searchkick, and the derived_data approach made it super easy to migrate code we already had. That could be a nice selling point for people looking to use rails with a newer version of ES.

We have the derived_data stuff working with a bit of monkey-patching, but I'm not a big fan of digging that deeply into a lib's innards in an app that consumes said gem, so I was hoping to get it in officially. Let me know if you want to explore this idea further.

matt-dutchie commented 4 years ago

Just wanted to add a 👍 here. We're in the process of migrating a mongo database to postgres and for relations that bridge the two databases it would be awesome to be able to pass in some derived data to the search_data method since we can't just do an includes in search_import as suggested by the docs.

ankane commented 2 years ago

Hey @TikiTDO, sorry for the very long delay. I don't think it's something I'd like to support right now (but you could fork/possibly use Module#prepend to add it).

TikiTDO commented 2 years ago

We have decided to opt out of the ES / OS ecosystem, since it's too much work to keep it going with the drama, not to mention the costs of running a small prod environment was quite a bit more than other systems could provide. That said we had previously worked around this issue using code like below. If this helps anyone sticking with ES this can be a decent starting point for a feature like described above.

module SearchkickExtension
  # Use to monkey patch Searchkick::BulkIndexer to request indexing_cache from the indexed class and pass to each record during indexing
  module BulkIndexer
    def bulk_index(records)
      # Check if the record supports an indexing cache
      records = records.first.class.populate_indexing_cache(records) if records.first&.class.respond_to?(:populate_indexing_cache)
      items_to_index = records.map {|record| Searchkick::RecordData.new(index, record).index_data }
      Searchkick.indexer.queue(items_to_index)
    end
  end
end

class User
  # Generate the cache for an entire batch of users
  def self.populate_indexing_cache(user_batch)
    expensive_computation = compute(user_batch) # Returns { user_id: { computation_result: number } }

    user_batch.map do |user|
      user.indexing_cache = expensive_computation[user.id] 
      user
    end
  end

  # Used by search_data to generated the indexes
  attr_writer :indexing_cache

  # Populate the cache if it has not been populated
  def indexing_cache
    self.class.populate_indexing_cache([self]) unless @indexing_cache
    @indexing_cache
  end

  # Generate the user data
  def search_data
    {
      name: name,
      computation_result: indexing_cache[:computation_result],
    }
  end
end

A lot of this can be extracted into a common concern, so the DSL to do this can be quite clean.

ankane commented 2 years ago

Great, thanks for sharing!

ankane / searchkick

[Idea] Batch pre-processing for derived data #1370