django / djangoproject.com

Source code to djangoproject.com
https://www.djangoproject.com/
BSD 3-Clause "New" or "Revised" License
1.87k stars 942 forks source link

Google search results displaying doc pages not in preferred language. #621

Closed timgraham closed 7 years ago

timgraham commented 8 years ago

Searching Google will often turn up docs pages that aren't in your preferred language. We should figure out how to fix this. Possible solution: https://support.google.com/webmasters/answer/2620865?hl=en

olasitarska commented 8 years ago

Also, maybe this? https://support.google.com/webmasters/answer/189077?hl=en

Just stumbled upon this webmasters answer today.

timgraham commented 8 years ago

Looks like the hreflang addition from David is working. Now I'm always seeing English results here. Thanks for the suggestion Ola!

rfleschenberg commented 8 years ago

If my browser language preference is set to [German, English], I still find the French (or sometimes the Japanese) docs instead of the English ones. If I set it to to [English, German], I find the English ones.

Example query: https://www.google.de/?q=django%20security%20advisories

I'm no expert on this, but maybe we could make use of hreflang="x-default" for the English docs?

Also, isn't it a little weird that https://docs.djangoproject.com/ja/1.9/releases/security/ exists, but has English content? Should we open another issue for this?

timgraham commented 8 years ago
  1. Google says:

    For language/country selectors or auto-redirecting homepages, you should add an annotation for the hreflang value "x-default" as well: <link rel="alternate" href="http://example.com/" hreflang="x-default" />

    It's not obvious to me that your proposal is the correct usage of x-default but I'm new to this as well.

  2. The alternate languages are added as soon as docs/intro is translated. English content will appear elsewhere until more translations are completed.
rfleschenberg commented 8 years ago

Yes, Google's documentation on this is confusing to me, too. If I understand this page correctly, it would be ok to use it in this case: https://webmasters.googleblog.com/2013/04/x-default-hreflang-for-international-pages.html

The new x-default hreflang attribute value signals to our algorithms that this page doesn’t target any specific language or locale and is the default page when no other page is better suited.

timgraham commented 8 years ago

I guess there's not much harm in trying. Care to submit a PR?

rfleschenberg commented 8 years ago

I'm quite unfamiliar with the relevant codebase, but I can create a PR if you allow me some time for it :)

rfleschenberg commented 8 years ago

Seems to be working better now -- thanks @gbdlin.

timgraham commented 8 years ago

Actually that commit isn't merged or deployed (https://github.com/django/djangoproject.com/pull/649).

rfleschenberg commented 8 years ago

Oh, right. And I can still reproduce the issue if I use a private browsing session. I will send a PR.

rfleschenberg commented 8 years ago

Unfortunately, I am still seeing this issue. Either the fix from 0ac648b is not working, or Google is really slow to pick up the change. Maybe someone with access to the Google Webmaster Tools or such can do some further analysis on this?

timgraham commented 8 years ago

Not sure if we have Webmaster Tools setup. I likely won't prioritize this myself before the 1.10 feature freeze, but I'll reopen this issue.

apollo13 commented 8 years ago

I just enabled it. On a quick glance, 0ac648b17c399f85810d72123c1cfe2452fb5810 is wrong -- we should still keep hreflang="en" there and have "x-default" in addition to en.

rfleschenberg commented 8 years ago

I was trying to follow the example from https://webmasters.googleblog.com/2013/04/x-default-hreflang-for-international-pages.html, but it is entirely possible that I misunderstood it :)

apollo13 commented 8 years ago

I was mainly following the example at https://support.google.com/webmasters/answer/189077?hl=en

Especially: Missing confirmation links: If page A links to page B, page B must link back to page A. If this is not the case for all pages that use hreflang annotations, those annotations may be ignored or not interpreted correctly. -- not sure if x-default does the trick there.

apollo13 commented 8 years ago

After reading a little bit more (especially: http://www.rebelytics.com/hreflang-canonical/ and http://www.thesempost.com/google-alerts-webmasters-issues-hreflang-relcanonical-urls/ )

I think that google is ignoring hreflang completely since our canonical links do not match any alternate link. Proposed solution (following the second picture of the first linked article):

If we want google to index all Django versions: change the canonical to point to the latest version and serve duplicate content. Put hreflang links as currently onto the versioned pages (not on the stable one though!).

If we want google to just index one version: point every versioned page to /stable/ (canonical) and only put hreflangs there.

Either way we should remove the redirect from the "canonical" url (stable currently) back to a versioned one -- this seems to utterly confuse google.

gbdlin commented 8 years ago

From my point of view most confusing part for google is almost same (not translated) content. I'm pretty sure that google is able to check actual language of site based on it's content and if it doesn't match tags, ignores them.

In my opinion pages that aren't translated in more than 50% shouldn't be visible for google (what is the point of indexing them anyway...).

apollo13 commented 8 years ago

@gbdlin Cause figuring out which pages are actually translated and which not is hard. Setting a canonical link and hreflang is easy and lets google do the right thing (even if that means showing half translated docs for spanish users etc…). Google can handle the current situation if we properly set the links.

apollo13 commented 8 years ago

What do you think about https://github.com/django/djangoproject.com/compare/seo

rfleschenberg commented 8 years ago

Looks great to me. Thank you for doing the research on this!

apollo13 commented 8 years ago

Ok, I've pushed it to master -- lets see if google picks up a few hreflang tags over the next days.

apollo13 commented 8 years ago

Changed a few things around, but the results are starting to look promising: Searching "django auth" on google.at gives me english as first match. Searching on google.fr gives me the french translation first!

timgraham commented 8 years ago

I think we're done for now.

apollo13 commented 8 years ago

Indeed :)

ulope commented 8 years ago

Unfortunately I think there is still something wrong here.

Various combinations of accept-language, localized google search page and available language version still seem to produce invalid results.

Example query: django databases

Physical location Accept-Language google.(.+) 1st result language correct?
de en-us de ja
de en-us it ja
de en-us nl ja
de en-us fr fr :white_check_mark:
de en-us es es :white_check_mark:
us en-us de ja
us de,en-US;q=0.8,en;q=0.6 de ja
us de,en-US;q=0.8,en;q=0.6 es ja
us de,en-US;q=0.8,en;q=0.6 com ja
us en-US,en;q=0.8,de;q=0.6 com en :white_check_mark:
us it,en-US;q=0.8,en;q=0.6 de ja
us it,en-US;q=0.8,en;q=0.6 it ja
us es,en-US;q=0.8,en;q=0.6 com es :white_check_mark:
us es,en-US;q=0.8,en;q=0.6 de es :white_check_mark:

All tests were done in "anonymous mode" and I repeated them with multiple browsers (Safari, Chrome, Firefox).

From this limited data set it seems that searches in languages for which no translated docs exist still default to the Japanese version.

apollo13 commented 8 years ago

What do you get with physical location de and Accept-Languages set to:

en-US,en;q=0.7,de;q=0.3

and

de,en-US;q=0.7,en;q=0.3

I am getting en in the first case and ja in the second one. Nice debugging!

apollo13 commented 8 years ago

Does something in Django set the proper language per request? Or can that https://github.com/django/djangoproject.com/blob/master/docs/views.py#L49-L50 result in leaking languages?

ulope commented 8 years ago

I am getting en in the first case and ja in the second one.

Yes, I get that as well.

gbdlin commented 8 years ago

Looks like google discards information about page language, when this information does not match context of page. In my opinion only solution for that is to release documentation translated for certain language, when translations are done or nearly done.

apollo13 commented 8 years ago

Looking at transifex, only the french translation would fit the criterion of "nearly done" (just from a percentage point of view), but I do not think that that is a feasible solution -- do we really have noone at google who can explain what we are doing wrong?

rfleschenberg commented 8 years ago

Doesn't it make more sense to say "sorry, we do not have documentation in Klingon right now, here is a link to the English documentation" than to say "here is the Klingon documentation" and then display content that is mostly in English?

apollo13 commented 8 years ago

The thing is a) how to determine if we actually have enough translated content on this page and b) I think showing half translated documentation can have an positive effect and encourage people to submit more translations.

Either way, we cannot be the only one with such problems…

gbdlin commented 8 years ago

@apollo13 I haven't noticed French translation in my google search result from long time and I don't have french in prefered languages neither in google nor in browser, so I can say: having actually translated content helps.

I think good way to go is to put link to english (or other languages) version on pages that haven't been translated to language prefered by user. We will then have some content visible in that language.

rfleschenberg commented 8 years ago

For a), I would suggest the proven best practice of "ask a human" ;)

b) is a very good point, I didn't think of that.

jezdez commented 7 years ago

I've stumbled over this a few times and wonder what the status about this is.

Whatever the technical solution is I think it hurts Django's reputation as having good documentation if we show half-translated documents to people coming from Google. So I'd suggest to role back shipping translated documentation for now until we've figured this out.

gbdlin commented 7 years ago

According to this document: https://support.google.com/webmasters/answer/182192?hl=en especially this fragment:

Google uses only the visible content of your page to determine its language. We don’t use any code-level language information such as lang attributes. You can help Google determine the language correctly by using a single language for content and navigation on each page, and by avoiding side-by-side translations.

Google will treat translations as duplicated content unless it is fully translated. I think best solution will be to add robots.txt file that will prevent google from indexing non-translated pages. Aparently there is no other solution.

shaib commented 7 years ago

On Tuesday 18 October 2016 11:03:44 GwynBleidD wrote:

Google will treat translations as duplicated content unless it is fully translated. I think best solution will be to add robots.txt file that will prevent google from indexing non-translated pages. Aparently there is no other solution.

I think there might be, but it may not be trivial,

Note that the problem is not really that Google finds an English page in the Spanish translation, but rather that links from that page point into pages which really are in Spanish; as long as English is displayed, having /es/ in the URL is not that bad. The problem is only that then we run into non- English content.

But what if we could make it so that non-translated content links back to English pages?

I'm not sure how feasible this is -- I don't know the innards of sphinx nearly well enough -- but I think if it is, it would solve most of the problem.

apollo13 commented 7 years ago

I think nuking the translated pages from orbit via robots.txt is the easiest technical approach. I do understand that this makes them harder to discover, but on the other hand, english still is the most important language for us and having a good pagerank there would help immensely.

gbdlin commented 7 years ago

@shaib the main problem is that google is showing URLs in results from pretty much random language, because it treats all urls as english. Second problem is actual non-translated content when someone enters from google, hoping for content translated to his/her language. Both of that problems can be solved using robots.txt.

Third problem is seeing untranslated content after changing language from english to any other one in documentation interface. This is in my opinion only solvable by translating everything and it's better to show something rather than nothing. We can add some information here that page is not fully translated and if you want, you can help translating it here and there.

Documentation isn't hosted on sphinx, it's an django app, so generating robots.txt file shouldn't be a problem. Only problem left is to decide when to switch from not indexing certain page to indexing.

m-aciek commented 7 years ago

@gbdlin Google says "Make sure the page language is obvious" ( https://support.google.com/webmasters/answer/182192?hl=en ). I would suggest treshold of 90% or 100% of translated content.

timgraham commented 7 years ago

@claudep, it seems you're missing out on this thread. Let us know if you have any ideas or preferences here.

claudep commented 7 years ago

The robots idea might indeed be the best track to follow. I'm not sure it will be feasible to automate it, but even if I had to manually manage it, I could do it as I'm the one which currently sync periodically translated content from Transifex to djangoproject.com.

apollo13 commented 7 years ago

@claudep I was thinking about writing a simple view which just disallows everything starting with a language prefix != en -- I think this is the best short-term solution for now

claudep commented 7 years ago

I'm volunteering to make something more fine-grained! Otherwise, it would be an insult for the people having spent days or weeks to translate the docs on their free time (me included!).

apollo13 commented 7 years ago

@claudep By all means, please do -- didn't mean to insult any one. I just hope this can be automated, otherwise it might be a lot to keep up with.

claudep commented 7 years ago

OK, tell me where the robots.txt should lie in the repository and I'll suggest something.

apollo13 commented 7 years ago

Well, if it is static, you can basically put it anywhere and we will hook it up via Nginx -- if you want it dynamic make it a view in the docs app (should match docs.djangoproject.com/robots.txt).

claudep commented 7 years ago

I've just added a script to produce a robots.txt file in https://github.com/django/django-docs-translations/commit/8b160a17501ef1f14589e2f0bfd319d46b1ee787 which currently gives:

User-agent: *
Disallow: /el/1.10/internals
Disallow: /el/1.10/ref
Disallow: /el/1.10/releases
Disallow: /el/1.10/topics
Disallow: /es/1.10/howto
Disallow: /es/1.10/internals
Disallow: /es/1.10/misc
Disallow: /es/1.10/ref
Disallow: /es/1.10/releases
Disallow: /es/1.10/topics
Disallow: /fr/1.10/internals
Disallow: /fr/1.10/releases
Disallow: /id/1.10/ref
Disallow: /id/1.10/releases
Disallow: /id/1.10/topics
Disallow: /ja/1.10/faq
Disallow: /ja/1.10/howto
Disallow: /ja/1.10/internals
Disallow: /ja/1.10/misc
Disallow: /ja/1.10/ref
Disallow: /ja/1.10/releases
Disallow: /ja/1.10/topics
Disallow: /pt_BR/1.10/ref
Disallow: /pt_BR/1.10/releases
Disallow: /pt_BR/1.10/topics

Do you find acceptable to manually copy that script's result in the djangoproject.com repository, or should we automate more?

claudep commented 7 years ago

Oh, and I set the threshold to 90% of translated content to not be listed in robots.txt.

apollo13 commented 7 years ago

Copying into the static files dir and adding nginx config seems good enough for me. Lets do that at DUTH sprints? Do you think it would make sense to also add disallows for older versions? I do know that they all set canonical to 1.10 and that should cause them to be excluded, but not 100% sure to be honest.