Closed timgraham closed 8 years ago
Also, maybe this? https://support.google.com/webmasters/answer/189077?hl=en
Just stumbled upon this webmasters answer today.
Looks like the hreflang
addition from David is working. Now I'm always seeing English results here. Thanks for the suggestion Ola!
If my browser language preference is set to [German, English], I still find the French (or sometimes the Japanese) docs instead of the English ones. If I set it to to [English, German], I find the English ones.
Example query: https://www.google.de/?q=django%20security%20advisories
I'm no expert on this, but maybe we could make use of hreflang="x-default"
for the English docs?
Also, isn't it a little weird that https://docs.djangoproject.com/ja/1.9/releases/security/ exists, but has English content? Should we open another issue for this?
For language/country selectors or auto-redirecting homepages, you should add an annotation for the hreflang value "x-default" as well:
<link rel="alternate" href="http://example.com/" hreflang="x-default" />
It's not obvious to me that your proposal is the correct usage of x-default
but I'm new to this as well.
docs/intro
is translated. English content will appear elsewhere until more translations are completed.Yes, Google's documentation on this is confusing to me, too. If I understand this page correctly, it would be ok to use it in this case: https://webmasters.googleblog.com/2013/04/x-default-hreflang-for-international-pages.html
The new x-default hreflang attribute value signals to our algorithms that this page doesn’t target any specific language or locale and is the default page when no other page is better suited.
I guess there's not much harm in trying. Care to submit a PR?
I'm quite unfamiliar with the relevant codebase, but I can create a PR if you allow me some time for it :)
Seems to be working better now -- thanks @gbdlin.
Actually that commit isn't merged or deployed (https://github.com/django/djangoproject.com/pull/649).
Oh, right. And I can still reproduce the issue if I use a private browsing session. I will send a PR.
Unfortunately, I am still seeing this issue. Either the fix from 0ac648b is not working, or Google is really slow to pick up the change. Maybe someone with access to the Google Webmaster Tools or such can do some further analysis on this?
Not sure if we have Webmaster Tools setup. I likely won't prioritize this myself before the 1.10 feature freeze, but I'll reopen this issue.
I just enabled it. On a quick glance, 0ac648b17c399f85810d72123c1cfe2452fb5810 is wrong -- we should still keep hreflang="en" there and have "x-default" in addition to en.
I was trying to follow the example from https://webmasters.googleblog.com/2013/04/x-default-hreflang-for-international-pages.html, but it is entirely possible that I misunderstood it :)
I was mainly following the example at https://support.google.com/webmasters/answer/189077?hl=en
Especially: Missing confirmation links: If page A links to page B, page B must link back to page A. If this is not the case for all pages that use hreflang annotations, those annotations may be ignored or not interpreted correctly. -- not sure if x-default does the trick there.
After reading a little bit more (especially: http://www.rebelytics.com/hreflang-canonical/ and http://www.thesempost.com/google-alerts-webmasters-issues-hreflang-relcanonical-urls/ )
I think that google is ignoring hreflang completely since our canonical links do not match any alternate link. Proposed solution (following the second picture of the first linked article):
If we want google to index all Django versions: change the canonical to point to the latest version and serve duplicate content. Put hreflang links as currently onto the versioned pages (not on the stable one though!).
If we want google to just index one version: point every versioned page to /stable/ (canonical) and only put hreflangs there.
Either way we should remove the redirect from the "canonical" url (stable currently) back to a versioned one -- this seems to utterly confuse google.
From my point of view most confusing part for google is almost same (not translated) content. I'm pretty sure that google is able to check actual language of site based on it's content and if it doesn't match tags, ignores them.
In my opinion pages that aren't translated in more than 50% shouldn't be visible for google (what is the point of indexing them anyway...).
@gbdlin Cause figuring out which pages are actually translated and which not is hard. Setting a canonical link and hreflang is easy and lets google do the right thing (even if that means showing half translated docs for spanish users etc…). Google can handle the current situation if we properly set the links.
What do you think about https://github.com/django/djangoproject.com/compare/seo
Looks great to me. Thank you for doing the research on this!
Ok, I've pushed it to master -- lets see if google picks up a few hreflang tags over the next days.
Changed a few things around, but the results are starting to look promising: Searching "django auth" on google.at gives me english as first match. Searching on google.fr gives me the french translation first!
I think we're done for now.
Indeed :)
Unfortunately I think there is still something wrong here.
Various combinations of accept-language, localized google search page and available language version still seem to produce invalid results.
Example query: django databases
Physical location | Accept-Language | google.(.+) | 1st result language | correct? |
---|---|---|---|---|
de | en-us | de | ja | |
de | en-us | it | ja | |
de | en-us | nl | ja | |
de | en-us | fr | fr | :white_check_mark: |
de | en-us | es | es | :white_check_mark: |
us | en-us | de | ja | |
us | de,en-US;q=0.8,en;q=0.6 | de | ja | |
us | de,en-US;q=0.8,en;q=0.6 | es | ja | |
us | de,en-US;q=0.8,en;q=0.6 | com | ja | |
us | en-US,en;q=0.8,de;q=0.6 | com | en | :white_check_mark: |
us | it,en-US;q=0.8,en;q=0.6 | de | ja | |
us | it,en-US;q=0.8,en;q=0.6 | it | ja | |
us | es,en-US;q=0.8,en;q=0.6 | com | es | :white_check_mark: |
us | es,en-US;q=0.8,en;q=0.6 | de | es | :white_check_mark: |
All tests were done in "anonymous mode" and I repeated them with multiple browsers (Safari, Chrome, Firefox).
From this limited data set it seems that searches in languages for which no translated docs exist still default to the Japanese version.
What do you get with physical location de and Accept-Languages set to:
en-US,en;q=0.7,de;q=0.3
and
de,en-US;q=0.7,en;q=0.3
I am getting en in the first case and ja in the second one. Nice debugging!
Does something in Django set the proper language per request? Or can that https://github.com/django/djangoproject.com/blob/master/docs/views.py#L49-L50 result in leaking languages?
I am getting en in the first case and ja in the second one.
Yes, I get that as well.
Looks like google discards information about page language, when this information does not match context of page. In my opinion only solution for that is to release documentation translated for certain language, when translations are done or nearly done.
Looking at transifex, only the french translation would fit the criterion of "nearly done" (just from a percentage point of view), but I do not think that that is a feasible solution -- do we really have noone at google who can explain what we are doing wrong?
Doesn't it make more sense to say "sorry, we do not have documentation in Klingon right now, here is a link to the English documentation" than to say "here is the Klingon documentation" and then display content that is mostly in English?
The thing is a) how to determine if we actually have enough translated content on this page and b) I think showing half translated documentation can have an positive effect and encourage people to submit more translations.
Either way, we cannot be the only one with such problems…
@apollo13 I haven't noticed French translation in my google search result from long time and I don't have french in prefered languages neither in google nor in browser, so I can say: having actually translated content helps.
I think good way to go is to put link to english (or other languages) version on pages that haven't been translated to language prefered by user. We will then have some content visible in that language.
For a), I would suggest the proven best practice of "ask a human" ;)
b) is a very good point, I didn't think of that.
I've stumbled over this a few times and wonder what the status about this is.
Whatever the technical solution is I think it hurts Django's reputation as having good documentation if we show half-translated documents to people coming from Google. So I'd suggest to role back shipping translated documentation for now until we've figured this out.
According to this document: https://support.google.com/webmasters/answer/182192?hl=en especially this fragment:
Google uses only the visible content of your page to determine its language. We don’t use any code-level language information such as lang attributes. You can help Google determine the language correctly by using a single language for content and navigation on each page, and by avoiding side-by-side translations.
Google will treat translations as duplicated content unless it is fully translated. I think best solution will be to add robots.txt file that will prevent google from indexing non-translated pages. Aparently there is no other solution.
On Tuesday 18 October 2016 11:03:44 GwynBleidD wrote:
Google will treat translations as duplicated content unless it is fully translated. I think best solution will be to add robots.txt file that will prevent google from indexing non-translated pages. Aparently there is no other solution.
I think there might be, but it may not be trivial,
Note that the problem is not really that Google finds an English page in the
Spanish translation, but rather that links from that page point into pages
which really are in Spanish; as long as English is displayed, having /es/
in the URL is not that bad. The problem is only that then we run into non-
English content.
But what if we could make it so that non-translated content links back to English pages?
I'm not sure how feasible this is -- I don't know the innards of sphinx nearly well enough -- but I think if it is, it would solve most of the problem.
I think nuking the translated pages from orbit via robots.txt is the easiest technical approach. I do understand that this makes them harder to discover, but on the other hand, english still is the most important language for us and having a good pagerank there would help immensely.
@shaib the main problem is that google is showing URLs in results from pretty much random language, because it treats all urls as english. Second problem is actual non-translated content when someone enters from google, hoping for content translated to his/her language. Both of that problems can be solved using robots.txt.
Third problem is seeing untranslated content after changing language from english to any other one in documentation interface. This is in my opinion only solvable by translating everything and it's better to show something rather than nothing. We can add some information here that page is not fully translated and if you want, you can help translating it here and there.
Documentation isn't hosted on sphinx, it's an django app, so generating robots.txt file shouldn't be a problem. Only problem left is to decide when to switch from not indexing certain page to indexing.
@gbdlin Google says "Make sure the page language is obvious" ( https://support.google.com/webmasters/answer/182192?hl=en ). I would suggest treshold of 90% or 100% of translated content.
@claudep, it seems you're missing out on this thread. Let us know if you have any ideas or preferences here.
The robots
idea might indeed be the best track to follow. I'm not sure it will be feasible to automate it, but even if I had to manually manage it, I could do it as I'm the one which currently sync periodically translated content from Transifex to djangoproject.com.
@claudep I was thinking about writing a simple view which just disallows everything starting with a language prefix != en -- I think this is the best short-term solution for now
I'm volunteering to make something more fine-grained! Otherwise, it would be an insult for the people having spent days or weeks to translate the docs on their free time (me included!).
@claudep By all means, please do -- didn't mean to insult any one. I just hope this can be automated, otherwise it might be a lot to keep up with.
OK, tell me where the robots.txt should lie in the repository and I'll suggest something.
Well, if it is static, you can basically put it anywhere and we will hook it up via Nginx -- if you want it dynamic make it a view in the docs app (should match docs.djangoproject.com/robots.txt).
I've just added a script to produce a robots.txt file in https://github.com/django/django-docs-translations/commit/8b160a17501ef1f14589e2f0bfd319d46b1ee787 which currently gives:
User-agent: *
Disallow: /el/1.10/internals
Disallow: /el/1.10/ref
Disallow: /el/1.10/releases
Disallow: /el/1.10/topics
Disallow: /es/1.10/howto
Disallow: /es/1.10/internals
Disallow: /es/1.10/misc
Disallow: /es/1.10/ref
Disallow: /es/1.10/releases
Disallow: /es/1.10/topics
Disallow: /fr/1.10/internals
Disallow: /fr/1.10/releases
Disallow: /id/1.10/ref
Disallow: /id/1.10/releases
Disallow: /id/1.10/topics
Disallow: /ja/1.10/faq
Disallow: /ja/1.10/howto
Disallow: /ja/1.10/internals
Disallow: /ja/1.10/misc
Disallow: /ja/1.10/ref
Disallow: /ja/1.10/releases
Disallow: /ja/1.10/topics
Disallow: /pt_BR/1.10/ref
Disallow: /pt_BR/1.10/releases
Disallow: /pt_BR/1.10/topics
Do you find acceptable to manually copy that script's result in the djangoproject.com repository, or should we automate more?
Oh, and I set the threshold to 90% of translated content to not be listed in robots.txt.
Copying into the static files dir and adding nginx config seems good enough for me. Lets do that at DUTH sprints? Do you think it would make sense to also add disallows for older versions? I do know that they all set canonical to 1.10 and that should cause them to be excluded, but not 100% sure to be honest.
Searching Google will often turn up docs pages that aren't in your preferred language. We should figure out how to fix this. Possible solution: https://support.google.com/webmasters/answer/2620865?hl=en