algolia / algoliasearch-rails

AlgoliaSearch integration to your favorite ORM
MIT License
409 stars 119 forks source link

Indexing big records #241

Open hannesfostie opened 7 years ago

hannesfostie commented 7 years ago

Hi folks,

The Algolia docs mention that you should try to split big records into multiple objects to be indexed. The Rails app that I work on has a couple of those, and I tried to find a way to do this with the algoliasearch-rails gem but it appears that is not currently possible.

If I were to do this using the ruby library for Algolia, it would mean basically reinventing the wheel, and reimplementing much of the callbacks and what not that this gem conveniently adds for you.

That made me think of an alternative, one that I think would be a good feature for this gem, even if it's an undocumented one. I created this issue to bounce my idea off of you, validate if it would work, ask for pointers, and finally ask if it would be accepted as a feature if it lives up to your standards.

What I had in mind is basically extracting the code that transforms object attributes into json into a new method, let's call it #to_algolia_json (or hash).

Because this is now a method with a single responsibility, it would allow me to overwrite this method in our models and return a hash or an array (or something else that responds to #to_json).

The idea here is that if we're dealing with a single hash, we could index it like this gem already does. If it's an array, we could create multiple records in Algolia instead of a single one that is too big. Dealing with a return value that is an array might require a change in another place, possibly the Algolia ruby library.

Does this make sense?

Thank you!

redox commented 7 years ago

Hi folks,

The Algolia docs mention that you should try to split big records into multiple objects to be indexed. The Rails app that I work on has a couple of those, and I tried to find a way to do this with the algoliasearch-rails gem but it appears that is not currently possible.

Hi @hannesfostie,

that's correct. So far it's not something doable with this rails integration.

If I were to do this using the ruby library for Algolia, it would mean basically reinventing the wheel, and reimplementing much of the callbacks and what not that this gem conveniently adds for you.

Right :/

That made me think of an alternative, one that I think would be a good feature for this gem, even if it's an undocumented one. I created this issue to bounce my idea off of you, validate if it would work, ask for pointers, and finally ask if it would be accepted as a feature if it lives up to your standards.

Oh yes sure; that would be awesome :)

What I had in mind is basically extracting the code that transforms object attributes into json into a new method, let's call it #to_algolia_json (or hash).

I would maybe go for to_algolia_object?

The idea here is that if we're dealing with a single hash, we could index it like this gem already does. If it's an array, we could create multiple records in Algolia instead of a single one that is too big. Dealing with a return value that is an array might require a change in another place, possibly the Algolia ruby library.

I get that and I see one potential issue we'll need to deal with:

Ex:

Does that make sense?

The original code has been written a looooong time ago and since then it has been patched here and there; making the settings/options/replicas handling is little bit messy.

If you don't manage to make it work, let me know; happy to help!

hannesfostie commented 7 years ago

@redox the "problem" you mention is one I had thought of as well (though in a different scenario), you make a very good point. What I was trying to figure out the other day is how Algolia is meant to keep track of "Algolia Objects" for each "ActiveRecord Object". Do the Algolia Objects all share the AR ID of the AR Object? The docs mention "distinct" queries, I suppose they'd use this ID?

If that is the case then it should be possible to delete them all and just regenerate them, so that none are left behind. The one thing we'd need to figure out is if this ID could ever change, so that no orphaned objects are left.

hannesfostie commented 7 years ago

I've been going through the code a little yesterday and then this morning, and now feel kind of stuck. I don't feel confident making any changes to start adding this feature because a lot of the methods have different variations, instance vs class methods, and use a bunch of instance variables whose (possible) values are not entirely clear to me.

I was trying to refactor the code to a point where an AR class/model has an instance method to_algolia_object that returns the hash to be indexed, but didn't get very far. Is there any chance I could get some pointers, or for someone to give this a shot so that I can try to take it from there?

Thanks!

redox commented 7 years ago

I was trying to refactor the code to a point where an AR class/model has an instance method to_algolia_object that returns the hash to be indexed, but didn't get very far. Is there any chance I could get some pointers, or for someone to give this a shot so that I can try to take it from there?

I'm gonna take a look at it, probably next week because of a packed WE /o\

hannesfostie commented 7 years ago

Appreciate it @redox !

hannesfostie commented 7 years ago

Have you been able to take a look at this by any chance, @redox ?

redox commented 7 years ago

Sorry @hannesfostie; I didn't... I'll work on it next week!

redox commented 7 years ago

I took a deeper look at the code @hannesfostie and we might have one issue with the deletion process.

For now, the Rails gem is in charge of deleting the objects once they are removed from the source DB. As soon as you start splitting the objects into multiple objects, any update of the source object could trigger deletions.

For instance, let's assume you have an object with a big text attribute that is ultimately split into 3 smaller objects. If you update this object and it's now only split into 2 objects (because the attribute is now shorter), you should remove 1 object from the index and update/override the 2 others.

Unfortunately, removing those objects is not that straight forward... I'm afraid the current architecture of the Rails gem is not super suitable for such a use-case and I strongly think you guys should build something custom on top of the algoliasearch gem -> because you'll be able to write it for you needs, I believe it will be way easier (and less messy for the gem).

I can help you guys write a Concern doing that if you think this could be helpful. Let me know what you think @hannesfostie.

hannesfostie commented 7 years ago

That sounds good @redox - do you mind if I email you in the next couple days on the address in your profile?

I was afraid modifying the gem would be tricky, so this solution works for us. I do think that in the long term, refactoring the rails gem so it supports this and is a little bit more modular would make for a huge improvement both for its users as for you and your colleagues since it will make changes easier and allow for more customization.

Spone commented 6 years ago

I'm interested in this as well, but for a different use case: I have events with a start_date and an end_date. I'd like to be able to input a single date, and get all events that include this date.

I think the best way of doing that is to create 1 record per day for each event (ie. a 5-day event will be stored as 5 records in the index). Then I'll use the distinct feature to de-duplicate the hits.

So I'd need a way to create several records for each event...

What do you think?

Spone commented 6 years ago

Hi @redox, any idea?

fatihtas commented 6 years ago

Well, I am suffering lots! from this max 10kb or 20kb idea as well. Can you please take a look at this problem & rails gem support asap. I know for sure that, i won't be able to custom develop such feature for my product, as i am the only developer, and this is a must for me now to be able to use algolia.

Or, you can just cancel the 10kb-20kb rule like it used to be. and this all sorts out itself easily..

sagar-ranglani commented 4 years ago

Any updates on this?

rememberlenny commented 1 month ago

You can use ActiveSupport::JSON.encode(your_record).size in your function handling indexing to evaluate how big the record is.