divio / aldryn-search

Haystack 2.0 search index for django CMS
Other
48 stars 77 forks source link

Language detection fails #48

Open jakob-o opened 8 years ago

jakob-o commented 8 years ago

Hi,

I am facing an issue when creating blog posts / publishing pages from the page listing in another language. The basic problem is, that get_language() returns the wrong value e.g. for this link http://localhost:8000/en/admin/cms/page/13/de/publish/?redirect_language=en&redirect_page_id=5. The same applies for blog posts.

To clarify / how to reproduce:

Expected:

Actual:

I tried to workaround that by inspecting the instance in a custom implementation of the router, unfortunately there is not unified way to do so. The Title model for instance has a language attribute whilst an Article has a language_code. The only solution for now that comes to my mind would be to store all entries in one index and query with a language filter...

Any recommendations / help regarding this are greatly appreciated.

Regards

Jakob

czpython commented 8 years ago

Hello Jakob,

Just to make sure I understand, in your example above you expect the de title to be indexed but instead the en one is the one that gets indexed? Also this only happens only on actions like publish or save() meaning only through the haystack signal handlers and not the manual rebuilding of the index?

jakob-o commented 8 years ago

EDIT: I expect the de title to be indexed on the de haystack connection, instead it is written to the en connection. The contents are correct however.

EDIT 2: This only happens when publishing. I should read your post, before answering...

Hi,

this is exactly the case. The same applies for a blog post when going to the blog page in the primary langue, lets say en, you go to Add new article... via the menu, then switch to the secondary language tab, e.g. de and then create the blog post. The language router fails to detect that the blog post was created in German and not in English. I think there actually is no easy solution to this, at least none I could think of. Maybe one could leverage the get_language(object) method on the Index somehow, but I am not sure if the router can / should access the index classes...

Thanks for your quick reply

Jakob

czpython commented 8 years ago

So yes this is a problem I'm quite familiar with.

And yes the only solution to this is to leverage get_language() method because it's the only method that knows how to fetch the "active" language for an object.

In the past, I've defined the should_update on my index base class and let it return False to prevent "realtime indexing" as a workaround for this issue.

As far as a real solution comes, I've pushed a potential fix to the fixes/let-object-decide-alias branch. It's something I've experimented with in the past but I won't push it to master today since I'd like to review it carefully and make sure it has no crazy side-effects :).

Please have a look at the branch and let me know if it works for you, I did a quick test locally and it worked.

jakob-o commented 8 years ago

Thank you, I will check it out!

jakob-o commented 8 years ago

I haven't tested the code, yet but from the looks this is what I imagined. However the line

unified_index = connections[alias].get_unified_index()

https://github.com/aldryn/aldryn-search/blob/fixes/let-object-decide-alias/aldryn_search/router.py#L32 looks suspicious to me, since we are trying to find the index of a potentially wrong connection (the alias might differ from the instance language, which is the root cause of this issue). I am no expert on haystack though and am not sure if this could lead to errors. Is there a case where one would to exclude an index in one language but not in another? Maybe a more failsafe way would be iterating over all available connections / aliases if not found with the current alias. I have no clue about the performance impact of such an implementation and whether caching the index would be a viable option or not. It's not pretty either...

What dou you think?

Regards

Jakob

czpython commented 8 years ago

Your suspicion is accurate, haystack allows you to exclude indexes from certain connections and so it is possible that we fail to get an index for a connection but in my opinion it's better than getting the language wrong every time. I think a better solution would require a change in the way haystack triggers the object indexing (signal). Iterating over all connections is not a bad idea but I'm not really a fan of brute forcing it but it might be the way to go until a proper fix can be integrated in Haystack.

I agree that the code is not very pretty because it exposes an internal logic in Haystack, but this is actually found everywhere there :)

Regarding a potential for performance impact, it really depends. Currently there's none.

Accessing a connection is just accessing a dictionary and never touching the search engine, after that we call the get_language() method which simply returns a language for the given instance. This method is defined by both the TitleIndex and ArticleIndex, neither of the two make any extra queries or requests to get the language from an instance. That said, if you add some queries in the get_language() method then yes it would result in a performance impact because it would be called once for the router and another one for the indexing.

jakob-o commented 8 years ago

FYI, works for my use cases so far.

Thank you very much

Jakob

jakob-o commented 8 years ago

Hey @czpython,

what happened to the branch if I may politely inquire? Is there any way this will make it into a release? Realtime indexing is seriously broken without this fix...

Thank you very much

Jakob

czpython commented 8 years ago

Hey @jakob-o,

I meant to post an update here.. I closed the pr https://github.com/aldryn/aldryn-search/pull/49#issuecomment-170371362 and posted an explanation as to why I decided it's not the way to go.

I have a fix for this issue on cms pages which I will push sometime this week.

As far as other addons like Aldryn Newsblog go, I'll have to think of another solution :cry:

jakob-o commented 8 years ago

EDIT: OK forget that we could use the function that I mentioned, but would there be a possibility to detect the publishing language another way?

Hey @czpython,

how about a middleware that stores the result from get_language_from_request(request, check_path=True) in a thread local and use this in the language router if available?

Regards

Jakob

jakob-o commented 8 years ago

Another possibility would be a registry, that allows language lookup based on the instance. Something like:

def get_instance_language(instance):
    try:
        lookup = language_lookups[instance.__class__]
    except KeyError:
        return get_language()
    else:
        return lookup.get_language(instance)
czpython commented 8 years ago

@jakob-o Thanks for the suggestions. I've decided to use a more explicit approach for the addons by creating a base index in https://github.com/aldryn/aldryn-translation-tools that will then provide some functionality to index the correct language for a parler object.

I will be using the add_to_index signal and make sure to ignore haystack's default post_save signal handler as it's too generic.

jakob-o commented 8 years ago

EDIT: I just realized the code changes in this repo were only for the unpublishing issue... My question remains though.

EDIT2: Created a pull request https://github.com/aldryn/aldryn-search/pull/61

Hey @czpython,

I had a look at the updated code and am not sure how this would solve the language detection problem. I am fine with it for now however (using a custom router implementing the registry approach). Two small things though:

  1. I think the action argument should be added in the signal's providing_args
  2. Would you mind refactoring the action argument to cms_action or something more distinguishable?

The latter one would spare me a lot of trouble since I am using a custom implementation based on aldryn_search and https://github.com/django-haystack/celery-haystack which is adding an action argument too which causes clashes... sigh

Thank you very much!

Jakob

czpython commented 8 years ago

Hello @jakob-o,

I've not gotten around to implement the language fix yet..

Regarding your points:

  1. I agree :)
  2. I like the name action but can rename to user_action, cms_action is too specific to cms. The idea behind this argument is to be leveraged by other indexes to conditionally update the index like it's being used by the TitleIndex.
jakob-o commented 8 years ago

How about index_action? I will happily update my PR, you decide the name.