Closed TikiTDO closed 2 years ago
Hey @TikiTDO, thanks for the suggestion! Let me think about this one. In the past, I've taken an approach similar to performant conversions. The derived data idea has the benefit speeding up bulk reindexing without the need for a cache, but reindexing individual records might still be expensive.
Yeah, this is purely a bulk re-index optimization. I already have some ideas how to use your existing conversion functionality to boost some of our slower individual indexes, but this is a slightly different scenario, mostly due to the fact that we have to re-index huge numbers of records fairly frequently.
Incidentally, the underlying use case was moving from chewy to searchkick, and the derived_data
approach made it super easy to migrate code we already had. That could be a nice selling point for people looking to use rails with a newer version of ES.
We have the derived_data
stuff working with a bit of monkey-patching, but I'm not a big fan of digging that deeply into a lib's innards in an app that consumes said gem, so I was hoping to get it in officially. Let me know if you want to explore this idea further.
Just wanted to add a 👍 here. We're in the process of migrating a mongo database to postgres and for relations that bridge the two databases it would be awesome to be able to pass in some derived data to the search_data
method since we can't just do an includes
in search_import
as suggested by the docs.
Hey @TikiTDO, sorry for the very long delay. I don't think it's something I'd like to support right now (but you could fork/possibly use Module#prepend
to add it).
We have decided to opt out of the ES / OS ecosystem, since it's too much work to keep it going with the drama, not to mention the costs of running a small prod environment was quite a bit more than other systems could provide. That said we had previously worked around this issue using code like below. If this helps anyone sticking with ES this can be a decent starting point for a feature like described above.
module SearchkickExtension
# Use to monkey patch Searchkick::BulkIndexer to request indexing_cache from the indexed class and pass to each record during indexing
module BulkIndexer
def bulk_index(records)
# Check if the record supports an indexing cache
records = records.first.class.populate_indexing_cache(records) if records.first&.class.respond_to?(:populate_indexing_cache)
items_to_index = records.map {|record| Searchkick::RecordData.new(index, record).index_data }
Searchkick.indexer.queue(items_to_index)
end
end
end
class User
# Generate the cache for an entire batch of users
def self.populate_indexing_cache(user_batch)
expensive_computation = compute(user_batch) # Returns { user_id: { computation_result: number } }
user_batch.map do |user|
user.indexing_cache = expensive_computation[user.id]
user
end
end
# Used by search_data to generated the indexes
attr_writer :indexing_cache
# Populate the cache if it has not been populated
def indexing_cache
self.class.populate_indexing_cache([self]) unless @indexing_cache
@indexing_cache
end
# Generate the user data
def search_data
{
name: name,
computation_result: indexing_cache[:computation_result],
}
end
end
A lot of this can be extracted into a common concern, so the DSL to do this can be quite clean.
Great, thanks for sharing!
The chewy ES adapter has the concept of
crutches
, which is really great for avoiding N+1 queries for derived data.The
search_import
scope feature of searchkick solves some of these same problems, but it's not always the best tool for the job, particularly when the derived data may necessitate loading a huge number ofhas_many
relations. In those cases I find that it's faster and easier to write my own code to get the data using SQL.As an example, here I'm finding relations from a grandparent, through a polymorphic association (it's all using UUIDs to avoid other collisions), which is an order of magnitude faster than using
includes
.I'd like to do something similar with searchkick. To accomplish this I'd like to propose adding an extra
derived_data
callback, which would live in bulk_indexer.rb.Here's a rough example of what it would look like.
This data could then be passed into the data method inside recrord_data.rb
We can put all this together into a PR with some tests, but I wanted to first check if this is something you'd be open to, or if you have another idea how to accomplish this task without more changes.