Closed jaguarfi closed 3 years ago
teacher needs spring 2020: nr 8. Prioritized for summer 2020, to be allocated for one of the summer interns.
@Mankro According to Google's instructions, alternate languages should be specified in <xhtml:link rel="alternate"
elements. Support for this was implemented in Django 3.2, released in April 2021 (!).
Django ticket: https://code.djangoproject.com/ticket/27395 Commit: https://github.com/django/django/commit/16218c20606d8cd89c5393970c83da04598a3e04
Should we update our version of Django? Or monkey-patch the sitemap XML returned by Django? Or just not comply with Google's standard and hope it works?
Or we could also roll our own sitemap generator, if updating Django isn't in the cards at the moment. The sitemap format is very simple.
I was planning that we would upgrade to Django 3.2 in the autumn 2021. We could at least start with the basic parts of the sitemap without any hacks.
A+ Django 3.2 upgrade might be quite easy, but I don't know yet. The changes do not look very difficult at first.
Actually, there seems to be another way of linking between the alternate versions of a page. We could add links to each language version in the <head>
element, as explained here:
<head>
<title>Widgets, Inc</title>
<link rel="alternate" hreflang="en-gb"
href="http://en-gb.example.com/page.html" />
<link rel="alternate" hreflang="en-us"
href="http://en-us.example.com/page.html" />
<link rel="alternate" hreflang="en"
href="http://en.example.com/page.html" />
<link rel="alternate" hreflang="de"
href="http://de.example.com/page.html" />
<link rel="alternate" hreflang="x-default"
href="http://www.example.com/" />
</head>
Then we would only need the sitemap to point to the default URL of each course, and Google would figure out the alternate versions using the links in <head>
. This would be achievable with the current version of Django.
Or we could have separate sitemap.xml files for each language, and a sitemap index that aggregates links to each sitemap. This is also supported by the current version of Django.
@skulonen Interesting! Could you experiment how easy it is to add the <link hreflang>
elements to the base HTML template and the base course template? Core site pages like the accessibility statement are available in all languages listed in settings.py (in practice, Finnish and English). The course pages are only available in the languages defined in the CourseInstance model (often only one language English or Finnish, sometimes both). The course view classes/mixins also have certain language variables. The course templates should already have variables for listing all the languages of the course.
I don't think the template has direct access to a list of the course's languages. But we can add a new languages
property to the model, which splits the language
string property at pipes. In addition, we need absolute URLs for the links. We can get that from request.build_absolute_uri
. Then we just need to replace the hl
parameter for each language, for which we can easily write a new template filter. All in all, this snippet seems to work perfectly in the course template:
{% for language in instance.languages %}
<link rel="alternate" hreflang="{{ language }}" href="{{ request.build_absolute_uri|language_url:language }}" />
{% endfor %}
Sounds excellent except on building the absolute URL with the request
variable.
We don't want to use the request
variable for defining the hostname of the A+ server. We have the BASE_URL
variable in settings.py so that we can define the URL in the server. The request does not always use the correct hostname that the server could use for e.g. connecting to other servers. Especially localhost
used in testing does not work for that. The browser could use localhost, but the server must have another domain name like plus
for connecting to other containers.
Issue related to the request and absolute URL problem: https://github.com/apluslms/a-plus/issues/664
Another thing: note that Google can only search public course pages, that is, if CourseInstance.view_content_to == CourseInstance.VIEW_ACCESS.PUBLIC
. Otherwise, viewing the course requires login and the Google bot can not access them.
Ok, I understand I can't use request.build_absolute_uri
because it may not return the correct hostname. Just making sure, is it okay to use request.get_full_path
to get the URL of the current page without the hostname, and then prepend BASE_URL
to it?
Ok, I understand I can't use
request.build_absolute_uri
because it may not return the correct hostname. Just making sure, is it okay to userequest.get_full_path
to get the URL of the current page without the hostname, and then prependBASE_URL
to it?
Yes, that's good!
We can also give some hints in the sitemap regarding how often the page should be crawled and how important the page is in comparison to other pages on the site. I estimated the following values. We can also leave them out if we want, since the hints are optional. And we can also leave some of these pages out of the sitemap if we don't wan't them to be indexed.
Home page: monthly, 0.5 priority Course archive: monthly, 0.5 priority Course instance: daily, 1.0 priority Module: daily, 0.5 priority Course table of contents: monthly, 0.2 priority Exercise: daily, 0.2 priority Privacy notice, accessibility statement, support: yearly, 0.1 priority
I figured that the search engine user would much more likely want to visit a course instance main page, instead of a module or exercise out of context.
Exciting! We should talk about the priorities after a daily meeting.
I figured that the search engine user would much more likely want to visit a course instance main page, instead of a module or exercise out of context.
One use case where a search-engine user would be likely to examine a specific module (ebook) page is "random" discovery of learning materials by topic. E.g., someone googles for a Scala concept and ends up in one of our introductory course modules (and might become a student in an open-access course). I cannot estimate how rare such an occurrence might be though.
Anyway, thanks for looking into this issue!
No new feature has been added for the third bullet "Make it possible for the course to set pages not to be indexed by google (to be ommitted from the sitemap)." However, public courses may already set the audience of individual chapters so that they are only visible to logged-in users. Those chapters are also not included in the sitemap.
Conversation
TODO: