etianen / django-watson

Full-text multi-table search application for Django. Easy to install and use, with good performance.
BSD 3-Clause "New" or "Revised" License
1.2k stars 130 forks source link

Getting extremely broad search results when searching on username field #243

Open ianfitzpatrick opened 6 years ago

ianfitzpatrick commented 6 years ago

I am running into a weird issue where when searching on a username (like bob@example.com), for certain users I get extremely broad results...users that definitely do not have that phrase in their title, description, or content fields. In one case I get 7000+ results in my queryset, even though the email in question definitely only associated with one entry in my index.

To make things more confusing, some searches return as expected. If I do "kara@example.com" for instance, I get exactly one results, as would be expected since username is a unique field.

Here is my app config:

class UsersAppConfig(AppConfig):
    """
    Automatically import standalone signals file once app is ready.

    Get around a circular import error otherwise facing.
    """

    name = 'users'

    def ready(self):
        import signals 
        from django.contrib.auth.models import User
        watson.register(
            User, CaseInsensitiveSearchAdapter, fields=(
                'first_name',
                'last_name',
                'username'
            )
        )

And the custom adapter I created based on some code you posted:

class CaseInsensitiveSearchAdapter(watson.SearchAdapter):

    def get_title(self, obj):
        return super(
            CaseInsensitiveSearchAdapter, self
        ).get_title(obj).lower()

    def get_description(self, obj):
        return super(
            CaseInsensitiveSearchAdapter, self
        ).get_description(obj).lower()

    def get_content(self, obj):
        return super(
            CaseInsensitiveSearchAdapter, self
        ).get_content(obj).lower()

I am using MySQL as my database. When I manually inspect the data in the index, I don't see any duplication of data. And if I do a normal contains query for "bob@example.com" I only get one result.

Sorry this is not the best issue as I don't know how to provide a reduced case here. Maybe there is a forehead thunker here that sticks out though?

Thanks so much for your work on this project, it's really awesome. I'm in the process of ripping out haystack + solr with this, and if I can just get this weird case figured out it will greatly reduce the moving pieces in my system.

ianfitzpatrick commented 6 years ago

One idea I had was, could this be some weird interaction between the @ symbol and the query used in the MySQL backend? Just a WAG, but thought I'd throw it out there.

ianfitzpatrick commented 6 years ago

Okay I think I'm on the right track with my @ symbol theory. If I change:

backends.py RE_MYSQL_ESCAPE_CHARS = re.compile(r'["()><~*+-]', re.UNICODE)

to (add an @) RE_MYSQL_ESCAPE_CHARS = re.compile(r'["()><~*+-]@', re.UNICODE)

And then enclose my actual search query text in " " I get the result I am expecting, exactly one result for "bob@example.com".

According to the MySQL docs this an exact phrase match I believe, relevant SO answer: https://stackoverflow.com/questions/8961148/mysql-match-against-when-searching-e-mail-addresses

I'm in a situation where I want flexibility, users can search on name or email, so in the case of email i want to do an exact match, however I want more broad results when searching on name.

I still don't get why just some particular usernames (emails) are triggering these very broad search results, where was others are not. But I can live with that if I can just work around the issue.

So I think I just need to do some pre-processing on my search text and if I detect something email like in it, auto-enclose it in quotes (my users will not have the savvy to do this themselves).

etianen commented 6 years ago

Can I have a pull request to exclude that character? Sounds like a worthy bug fix.

On 25 April 2018 at 21:05, Ian Fitzpatrick notifications@github.com wrote:

Okay I think I'm on the right track with my @ symbol theory. If I change:

backends.py RE_MYSQL_ESCAPE_CHARS = re.compile(r'["()><~*+-]', re.UNICODE)

to (add an @) RE_MYSQL_ESCAPE_CHARS = re.compile(r'["()><~*+-]@', re.UNICODE)

And then enclose my actual search query text in " " I get the result I am expecting, exactly one result for "bob@example.com".

According to the MySQL docs this an exact phrase match I believe, relevant SO answer: https://stackoverflow.com/questions/8961148/mysql-match- against-when-searching-e-mail-addresses

I'm in a situation where I want flexibility, users can search on name or email, so in the case of email i want to do an exact match, however I want more broad results when searching on name.

I still don't get why just some particular usernames (emails) are triggering these very broad search results, where was others are not. But I can live with that if I can just work around the issue.

So I think I just need to do some pre-processing on my search text and if I detect something email like in it, auto-enclose it in quotes (my users will not have the savvy to do this themselves).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/etianen/django-watson/issues/243#issuecomment-384416668, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJFCEFQf3Qxt2sMzC4WI3U8mqUQNwtSks5tsNcCgaJpZM4Ti1Iz .

etianen commented 6 years ago

(Sorry I took so long to reply, I've been snowed under at work)

On 17 May 2018 at 17:32, Dave Hall dave@etianen.com wrote:

Can I have a pull request to exclude that character? Sounds like a worthy bug fix.

On 25 April 2018 at 21:05, Ian Fitzpatrick notifications@github.com wrote:

Okay I think I'm on the right track with my @ symbol theory. If I change:

backends.py RE_MYSQL_ESCAPE_CHARS = re.compile(r'["()><~*+-]', re.UNICODE)

to (add an @) RE_MYSQL_ESCAPE_CHARS = re.compile(r'["()><~*+-]@', re.UNICODE)

And then enclose my actual search query text in " " I get the result I am expecting, exactly one result for "bob@example.com".

According to the MySQL docs this an exact phrase match I believe, relevant SO answer: https://stackoverflow.com/ques tions/8961148/mysql-match-against-when-searching-e-mail-addresses

I'm in a situation where I want flexibility, users can search on name or email, so in the case of email i want to do an exact match, however I want more broad results when searching on name.

I still don't get why just some particular usernames (emails) are triggering these very broad search results, where was others are not. But I can live with that if I can just work around the issue.

So I think I just need to do some pre-processing on my search text and if I detect something email like in it, auto-enclose it in quotes (my users will not have the savvy to do this themselves).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/etianen/django-watson/issues/243#issuecomment-384416668, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJFCEFQf3Qxt2sMzC4WI3U8mqUQNwtSks5tsNcCgaJpZM4Ti1Iz .

ianfitzpatrick commented 6 years ago

Sure thing, I'll try and get something to you next week.