apluslms / a-plus

A+ frontend portal - A+ LMS documentation:
https://apluslms.github.io/
Other
67 stars 73 forks source link

A+ should support different URLs for different langauges (e.g. same chapter in Finnish and English); both should be indexed by Google and be searchable with targeted search #520

Closed jaguarfi closed 3 years ago

jaguarfi commented 4 years ago

Migrated from Trello

Update: the support for different language URLs has been implemented (#518). Sitemap for the Google search indexing has been implemented in https://github.com/apluslms/a-plus/pull/807.

No new feature has been added for the third bullet "Make it possible for the course to set pages not to be indexed by google (to be ommitted from the sitemap)." However, public courses may already set the audience of individual chapters so that they are only visible to logged-in users. Those chapters are also not included in the sitemap.

Conversation

Markku Riekkinen: This sounds quite tricky. Currently, the language is stored in the user session, but this feature would override that language setting temporarily if the URL supplies it. The same chapter would also have three different URLs: 1) current normal URL without language 2) URL with English, and 3) URL with Finnish language code. Would it be adequate to add the new language code to GET parameters since those would not conflict with existing URL configuration?

Juha Sorva: @markkuriekkinen Not sure. How would that interact with Google indexing and SEO? In any case, I think the current URL configuration is fundamentally flawed in that it treats language versions of chapters as different variants of the same content (same URL) when they really should be treated as separate (even if related) documents. There's been some tentative discussion with @raphendyr about redesigning the URL scheme in a future version of A+.

In any case, this is a very major issue IMHO: currently, students (and others) cannot find our content through web searches and our local Google search of ebook chapters doesn't work either.

Jaakko Kantojärvi: Due to how data is stored in the DB and stuff, I would say the query parameter that temporarily overwrites the user option for that page is the best option. Sadly, it will mostlikely mean that urls (e.g. navigation) will point to "normal" language and will drop the query parameter (Note: we can fix this with JS pretty nicely). Personally, I don't feel there is really good solution in any available ones.

Any case, with query parameters we should be able to generate sitemap for google, so they would know all the available versions. Documentation from google about the topic and sitemap on this matter:

https://support.google.com/webmasters/answer/182192 https://support.google.com/webmasters/answer/189077

p.s. those google support pages use the same system. The page includes query parameter ?hl=en when found using google search. Parameter ?hl=fi changes the page to Finnish. e.g.:

https://support.google.com/webmasters/answer/182192?hl=en https://support.google.com/webmasters/answer/182192?hl=fi

In addition, google support changes the locale parameter in your session after visiting the site. Thus, A+ could change the language from the query parameter for the current session, but keep it in user settings (#518 ). This of course, requires some UI to tell the user that they are viewing the course in a language not set in their settings.

TODO:

annirytkonen commented 4 years ago

teacher needs spring 2020: nr 8. Prioritized for summer 2020, to be allocated for one of the summer interns.

skulonen commented 3 years ago

@Mankro According to Google's instructions, alternate languages should be specified in <xhtml:link rel="alternate" elements. Support for this was implemented in Django 3.2, released in April 2021 (!).

Django ticket: https://code.djangoproject.com/ticket/27395 Commit: https://github.com/django/django/commit/16218c20606d8cd89c5393970c83da04598a3e04

Should we update our version of Django? Or monkey-patch the sitemap XML returned by Django? Or just not comply with Google's standard and hope it works?

skulonen commented 3 years ago

Or we could also roll our own sitemap generator, if updating Django isn't in the cards at the moment. The sitemap format is very simple.

markkuriekkinen commented 3 years ago

I was planning that we would upgrade to Django 3.2 in the autumn 2021. We could at least start with the basic parts of the sitemap without any hacks.

A+ Django 3.2 upgrade might be quite easy, but I don't know yet. The changes do not look very difficult at first.

skulonen commented 3 years ago

Actually, there seems to be another way of linking between the alternate versions of a page. We could add links to each language version in the <head> element, as explained here:

<head>
 <title>Widgets, Inc</title>
  <link rel="alternate" hreflang="en-gb"
       href="http://en-gb.example.com/page.html" />
  <link rel="alternate" hreflang="en-us"
       href="http://en-us.example.com/page.html" />
  <link rel="alternate" hreflang="en"
       href="http://en.example.com/page.html" />
  <link rel="alternate" hreflang="de"
       href="http://de.example.com/page.html" />
 <link rel="alternate" hreflang="x-default"
       href="http://www.example.com/" />
</head>

Then we would only need the sitemap to point to the default URL of each course, and Google would figure out the alternate versions using the links in <head>. This would be achievable with the current version of Django.

Or we could have separate sitemap.xml files for each language, and a sitemap index that aggregates links to each sitemap. This is also supported by the current version of Django.

markkuriekkinen commented 3 years ago

@skulonen Interesting! Could you experiment how easy it is to add the <link hreflang> elements to the base HTML template and the base course template? Core site pages like the accessibility statement are available in all languages listed in settings.py (in practice, Finnish and English). The course pages are only available in the languages defined in the CourseInstance model (often only one language English or Finnish, sometimes both). The course view classes/mixins also have certain language variables. The course templates should already have variables for listing all the languages of the course.

skulonen commented 3 years ago

I don't think the template has direct access to a list of the course's languages. But we can add a new languages property to the model, which splits the language string property at pipes. In addition, we need absolute URLs for the links. We can get that from request.build_absolute_uri. Then we just need to replace the hl parameter for each language, for which we can easily write a new template filter. All in all, this snippet seems to work perfectly in the course template:

{% for language in instance.languages %}
<link rel="alternate" hreflang="{{ language }}" href="{{ request.build_absolute_uri|language_url:language }}" />
{% endfor %}
markkuriekkinen commented 3 years ago

Sounds excellent except on building the absolute URL with the request variable.

We don't want to use the request variable for defining the hostname of the A+ server. We have the BASE_URL variable in settings.py so that we can define the URL in the server. The request does not always use the correct hostname that the server could use for e.g. connecting to other servers. Especially localhost used in testing does not work for that. The browser could use localhost, but the server must have another domain name like plus for connecting to other containers.

Issue related to the request and absolute URL problem: https://github.com/apluslms/a-plus/issues/664

markkuriekkinen commented 3 years ago

Another thing: note that Google can only search public course pages, that is, if CourseInstance.view_content_to == CourseInstance.VIEW_ACCESS.PUBLIC. Otherwise, viewing the course requires login and the Google bot can not access them.

https://github.com/apluslms/a-plus/blob/fb80c5e4878243cb3dcabd4434ca246b53618858/course/models.py#L375

skulonen commented 3 years ago

Ok, I understand I can't use request.build_absolute_uri because it may not return the correct hostname. Just making sure, is it okay to use request.get_full_path to get the URL of the current page without the hostname, and then prepend BASE_URL to it?

markkuriekkinen commented 3 years ago

Ok, I understand I can't use request.build_absolute_uri because it may not return the correct hostname. Just making sure, is it okay to use request.get_full_path to get the URL of the current page without the hostname, and then prepend BASE_URL to it?

Yes, that's good!

skulonen commented 3 years ago

We can also give some hints in the sitemap regarding how often the page should be crawled and how important the page is in comparison to other pages on the site. I estimated the following values. We can also leave them out if we want, since the hints are optional. And we can also leave some of these pages out of the sitemap if we don't wan't them to be indexed.

Home page: monthly, 0.5 priority Course archive: monthly, 0.5 priority Course instance: daily, 1.0 priority Module: daily, 0.5 priority Course table of contents: monthly, 0.2 priority Exercise: daily, 0.2 priority Privacy notice, accessibility statement, support: yearly, 0.1 priority

I figured that the search engine user would much more likely want to visit a course instance main page, instead of a module or exercise out of context.

markkuriekkinen commented 3 years ago

Exciting! We should talk about the priorities after a daily meeting.

jsorva commented 3 years ago

I figured that the search engine user would much more likely want to visit a course instance main page, instead of a module or exercise out of context.

One use case where a search-engine user would be likely to examine a specific module (ebook) page is "random" discovery of learning materials by topic. E.g., someone googles for a Scala concept and ends up in one of our introductory course modules (and might become a student in an open-access course). I cannot estimate how rare such an occurrence might be though.

Anyway, thanks for looking into this issue!