cavb1205 / django-solr-search

Automatically exported from code.google.com/p/django-solr-search
0 stars 0 forks source link

Selective indexing for models #15

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
We require selective indexing for our search documents. That is, we only
want to index Book.objects.filter(is_public=True), or some other selection
of Books, rather than Book.objects.all().

Xappy (partially) supports selective indexing by registering a document
('index') against a queryset instead of a model class. However this means
that when you change a particular book to is_public=False, the post_save
handler doesn't know to update the index. This could be remedied by doing a
costly queryset evaluation:
  if book in queryset:
    index.update(book)
  else:
    index.delete(book)

But obviously that involves evaluating a queryset for each model change, so
Xappy (wisely) haven't done that.

I recommend django-solr-search takes a better approach (see attached
patch). I added an is_indexable(instance) method to SearchDocument. This
method is called during the post_save handler and during reindexing. If it
returns False, the object is removed from the index. If True, object is
added/updated. Simple.

The downside of this over specifying a queryset up-front is that the
reindexing process still does a Book.objects.all() call. I couldn't think
of a good clean way to specify a different queryset that wouldn't violate
DRY. Feel free to suggest one?

Original issue reported on code.google.com by craig.ds@gmail.com on 2 Mar 2009 at 3:13

Attachments:

GoogleCodeExporter commented 9 years ago
I like it. I was thinking about making the post_save being part of the 
document, but
then the developer would be doing a lot more work then just an `is_indexable` 
like
you are suggesting. The only question I have is why are you sending a delete if 
it's
not indexable. Seems like just the first if statement is good enough. 

When thinking about DRY, it might be better to move the `is_indexable` into the 
add
statement of the document. If you check out the add in solr/connection.py it 
will
only do the add if there is xml.

I've attached a patch for this. I haven't tested it out, more thinking out loud.

Thank you for the patch and the documentation! Let me know what you think about 
this
approach.

Thanks,

Sean

Original comment by sean.cre...@gmail.com on 2 Mar 2009 at 4:28

Attachments:

GoogleCodeExporter commented 9 years ago
I thought about doing it in the add() method, as it was a smaller code change, 
but to
me that seems counter-intuitive. If I'm calling add() on every document, I would
expect every document to get added. Hence I put the test outside add() so it 
wouldn't
get called at all if the document wasn't indexable.

In my understanding the delete() call is necessary in case the object has been
indexed before and then changed. e.g.:

 b = Book.objects.create(is_public=True)   # (book is added to index)
 b.is_public = False
 b.save()                                  # (book should be deleted from index)

Original comment by craig.ds@gmail.com on 2 Mar 2009 at 11:04

GoogleCodeExporter commented 9 years ago
I have a situation that seems related to this issue.  I have a set of objects 
that should be viewable by only a 
subset of my users, users that belong to a particular group and thereby have a 
certain permission.  In my 
object model I have a BooleanField similar to is_public above, and in the view 
I check the user for having the 
necessary permission before I list the object in the search results.  This 
process works for me, though it isn't 
terribly efficient.  However, I don't think that the binary choice of either 
indexing the object or not works for 
this scenario.  And if things get more granular, as I expect them to do, with 
multiple groups of users each 
with their own set of objects, this approach will quite quickly fall apart.

However, one way around this would be to generate and use multiple indexes.  
For members of group A, use 
index A.  For members of group B, use index B.  An interesting wrinkle would be 
if a particular group should 
use indexes A and C, but not B and D -- how would you combine the results of a 
search of multiple indexes?

So that's my use case.  Again, what I'm doing right now is placing flags on my 
objects that correspond to 
permissions, and looping through the search results to weed out those objects 
for which the user does not 
have permission to view.  But it might be more efficient to do this at the 
index level.

Original comment by tphern...@gmail.com on 19 Mar 2009 at 7:05

GoogleCodeExporter commented 9 years ago
tpherndon,

One possible method is to create a permission field in your search document
(member_groups). Use the a custom transform_member_groups method to create a 
list of
space separated member ids for each document (for example '12 35 36'). When you 
query
Solr in your view simply pass a custom query with the valid member ids of the 
current
user.

q = Query(q='text:"user query" AND ( (member_groups:12) OR (member_groups:35) OR
(member_groups:36) )',model='example__doc')

Regards,
Daniel

Original comment by garcia.daniel.ee on 20 Mar 2009 at 3:56

GoogleCodeExporter commented 9 years ago
Added in http://code.google.com/p/django-solr-search/source/detail?r=22.

Used the add function. Waiting for some feedback.

Original comment by sean.cre...@gmail.com on 26 Mar 2009 at 12:29

GoogleCodeExporter commented 9 years ago
Hi Sean,
Please find attached a patch that helped me resolved the use case in comment 2.
Thanks
Michael

Original comment by michael.thornhill on 26 Mar 2009 at 12:56

Attachments:

GoogleCodeExporter commented 9 years ago
I think my patch above is slightly better than Michael's since reindexing is a 
bit
more robust.

Consider the case that someone updates the database directly. The document 
won't be
updated. If it's no longer indexable, it won't be removed until someone calls
post_save on it.

One would expect a reindex to fix this, but with Michael's patch it won't. 
That's why
I added the extra delete() call to my patch. It's an obscure case but it's good 
to be
robust...

Original comment by craig.ds@gmail.com on 26 Mar 2009 at 7:49

GoogleCodeExporter commented 9 years ago
Fixed in rev http://code.google.com/p/django-solr-search/source/detail?r=25.

2 is a majority, so we delete on every failed is_indexable in the post_save.

The reindex also deletes based on is_indexable too. 

Original comment by sean.cre...@gmail.com on 31 Mar 2009 at 6:03