carpentries / sandpaper

User Interface for The Carpentries Workbench
https://carpentries.github.io/sandpaper
Other
42 stars 27 forks source link

Think about i18n and l10n #18

Closed zkamvar closed 4 months ago

zkamvar commented 4 years ago

There are different levels for translation for a website:

  1. Translation of the menus and the messages (e.g. 404)
  2. Translation of the prose.

The former is relatively easy, the latter is quite complicated and involves tradeoffs.

David Pérez-Suárez gave a really good talk about this in his CarpentryCon talk: https://youtu.be/IzRCuk7XX18

His solution was to have a centralized hub to hold translations and work with git submodules to control when those translations would be updated. It's not a trivial issue because there's a balance of effort on the maintainers and the contributors. David had mentioned that the updates would only flag translation files, but the result would remain unchanged until a translator comes along and translates the changes.

To address the translation of the prose within the document, they modified https://github.com/carpentries-i18n/po4gitbook to work with the kramdown tags. Looking elsewhere, it doesn't seem that there's really any other good solution to translating markdown to po and back again.

I believe some of this process can be improved by implementing translation to XML on the backend since that provides a clear structure for lists, paragraphs, etc. The challenge is to match that to a clear grammar for how the po files are to be structured.

There's a good breakdown of the tasks necessary for i18n: https://wiki.mageia.org/en/What_is_i18n,_what_is_l10n#I18N

I looked into how the Hugo Learn theme does i18n (https://learn.netlify.app/en/cont/i18n/), but it appears that they have a basic structure for the message translation for the menus and messages, but the prose is expected to live in separate files in separate repositories.

zkamvar commented 3 years ago

I realize now that this ties in partially with #22 because one of the ideas for deploying this is to have a site where you can go to https://swcarpentry.github.io/lesson-name/en and get to the english lesson content and https://swcarpentry.github.io/lesson-name/ar and get to the arabic lesson content.

fmichonneau commented 3 years ago

While we are not going to have answers to all the complexity that comes with handling i18n correctly, having a plan for the translation of the template components as a first step would be great.

zkamvar commented 3 years ago

Linking similar issues:

https://github.com/r-lib/pkgdown/issues/1446 https://github.com/rstudio/bookdown/issues/1245

dpshelio commented 2 years ago

Another point to keep in mind is how to translate figures. I had tracked some options under https://github.com/carpentries-i18n/carpentry-theme/issues/6

ocaisa commented 1 year ago

@zkamvar I wonder if we can leverage CI to help in this space? I saw in #368 that there's a process for Rmd -> md and then from md to html. There are services like https://github.com/marketplace/crowdin for translations which are free for open source projects, perhaps we could leverage these? Crowdin supports markdown, but not rmarkdown as far as I could see so you'd need some CI steps to create a processed markdown branch that can be passed to crowdin, and then it can PR the translations back for other languages.

You can have a PR per language as well: https://github.com/crowdin/github-action/wiki/Separate-PRs-for-each-language

ocaisa commented 1 year ago

I found a recent video overview for how it can work: https://www.youtube.com/watch?v=5b7BMuCoKGg

That one doesn't seem to use machine translation

ocaisa commented 1 year ago

I was playing around with Crowdin this morning and I think it could be a good fit for the task.

As a test I used the shell lesson and created a project at https://crowdin.com/project/shell-novice2 (and here's a sample PR it creates and an example of incremental translation)

You get a WYSIWYG editor so context is straightforward. I found it reasonably easy to use. The only things I tweaked were which files to translate, and I explicitly included all hidden line since we need to control/maintain the full formatting for sandpaper.

I'm a little curious to see how building the alternate languages might work. In the example I place all Spanish content under a top-level folder (e.g. es-ES/episodes/) but this is all configurable. Do subfolders in folders like episodes retain structure after the build process? If so then you could support an optional top-level directory for languages in the config file perhaps and place files under, e.g., episodes/en-GB with translations under episodes/es-ES/?

zkamvar commented 1 year ago

Hi @ocaisa,

Thank you for returning to this. I apologise for not responding earlier, your first comment happened to appear at the very end of the beta phase, which kicked off a nearly month long sprint to deploy all of our lessons two days after I received word that my funding was not being continued (see https://carpentries.org/blog/2023/06/lesson-infrastructure-updates/) and since then, my work has shifted into building capacity among the rest of The Carpentries Core Team to maintain The Workbench after I'm gone (@tobyhodges can attest to the work that I've been doing in that space)

Unfortunately, this problem does not have an easy solution (as indicated by the complexities of https://github.com/carpentries/workbench/discussions/6) and it really requires time and funding that we do not have (I want to stress that it's not at all due to a lack of motivation). One of the challenges that the community faced in the past is the fragmentation of efforts for finding solutions to the translation problems, which resulted in three different approaches.

I think it's worthwhile to talk with @maelle, @joelnitta, and @yabellini, who have all been doing work in this space, thinking about, and advocating for a more standardized (and automated) mode of translations.

In particular, @maelle and @yabellini have been working in the @ropensci space to come up with {babeldown}, an experimental package to perform automated translation (via the DeepL API) and @joelnitta created https://github.com/joelnitta/dovetail, a framework for performing translations (storing the po files in a separate po/ folder).

ocaisa commented 1 year ago

@zkamvar No worries, I was aware of the unfortunate current status, and I know there is little point in promising new developments in such a scenario. Unfortunately, I neither know R nor the inner workings of sandpaper very well, so I come back to thinking about something more straightforward in the short term.

Focusing purely on the translation of the prose, the prose itself is contained in a list of subfolders. I would like to see being able to set a config option like

locale: es

that would look in places like episodes/es/ for the markdown sources that will be used when building the lesson. Reading utils-paths-source.R it looks like I could get away with doing that only in that file, but it would take time for me to figure out how to get a development environment up and running.

A short term answer for us could be to inject ourselves into the build process, overwriting the episodes with the translated versions in a preparatory step, then building the website and moving the final build under a /es subdirectory. That would be do-able for us, and doesn't come with any requirements on the Workbench side.

ocaisa commented 1 year ago

To close the circle here, for the time being I implemented a way to deploy a language build of the lesson to a subdirectory:

      - name: "Deploy Site"
        run: |
          # Configure git          
sandpaper:::check_git_user(getwd(), name = "GitHub Actions", email = "actions@github.com")
          # Prepare the worktree
          del_site <- sandpaper:::git_worktree_setup(".", fs::path(".","site","docs"), branch = "gh-pages", remote = "origin" )
          # Validate and build the lesson
          sandpaper:::validate_lesson(getwd())
          sandpaper:::build_lesson(preview = FALSE)
          # Move the lesson to a (temporary) language dir
          system("mv site/docs site/en")
          # Replace prose content with translations
          system("for DIR in episodes learners profiles instructors; do for FILE in $DIR/es-ES/*; do cp $FILE $DIR; done; done")
          # Validate and build the lesson
          sandpaper:::validate_lesson(getwd())
          sandpaper:::build_lesson(preview = FALSE)
          # Now move the lessons around into the default version with translation subdirs
          system("mv site/docs site/es")
          system("mv site/en site/docs")
          # Remove any existing language directory
          system("rm -rf site/docs/es")
          system("mv site/es site/docs/es")
          # Commit the final version which includes translations
          sandpaper:::github_worktree_commit(fs::path(getwd(),"site","docs"), "hack", "origin", "gh-pages")
          eval(del_site)

This is probably enough for our use case.

ocaisa commented 1 year ago

Also to keep some notes here, I looked into conceptually how this could be implemented (assuming a build process as described my previous comment). It could be done by creating a drop down menu beside the "Instructor View" and similarly structured:

<div class="selector-container">
  <div class="dropdown">
    <button class="btn btn-secondary dropdown-toggle bordered-button show" type="button" id="dropdownMenu1" data-bs-toggle="dropdown" aria-expanded="true">
      <svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="currentColor" class="bi bi-globe" viewBox="0 0 16 16"><path d="M0 8a8 8 0 1 1 16 0A8 8 0 0 1 0 8zm7.5-6.923c-.67.204-1.335.82-1.887 1.855A7.97 7.97 0 0 0 5.145 4H7.5V1.077zM4.09 4a9.267 9.267 0 0 1 .64-1.539 6.7 6.7 0 0 1 .597-.933A7.025 7.025 0 0 0 2.255 4H4.09zm-.582 3.5c.03-.877.138-1.718.312-2.5H1.674a6.958 6.958 0 0 0-.656 2.5h2.49zM4.847 5a12.5 12.5 0 0 0-.338 2.5H7.5V5H4.847zM8.5 5v2.5h2.99a12.495 12.495 0 0 0-.337-2.5H8.5zM4.51 8.5a12.5 12.5 0 0 0 .337 2.5H7.5V8.5H4.51zm3.99 0V11h2.653c.187-.765.306-1.608.338-2.5H8.5zM5.145 12c.138.386.295.744.468 1.068.552 1.035 1.218 1.65 1.887 1.855V12H5.145zm.182 2.472a6.696 6.696 0 0 1-.597-.933A9.268 9.268 0 0 1 4.09 12H2.255a7.024 7.024 0 0 0 3.072 2.472zM3.82 11a13.652 13.652 0 0 1-.312-2.5h-2.49c.062.89.291 1.733.656 2.5H3.82zm6.853 3.472A7.024 7.024 0 0 0 13.745 12H11.91a9.27 9.27 0 0 1-.64 1.539 6.688 6.688 0 0 1-.597.933zM8.5 12v2.923c.67-.204 1.335-.82 1.887-1.855.173-.324.33-.682.468-1.068H8.5zm3.68-1h2.146c.365-.767.594-1.61.656-2.5h-2.49a13.65 13.65 0 0 1-.312 2.5zm2.802-3.5a6.959 6.959 0 0 0-.656-2.5H12.18c.174.782.282 1.623.312 2.5h2.49zM11.27 2.461c.247.464.462.98.64 1.539h1.835a7.024 7.024 0 0 0-3.072-2.472c.218.284.418.598.597.933zM10.855 4a7.966 7.966 0 0 0-.468-1.068C9.835 1.897 9.17 1.282 8.5 1.077V4h2.355z"></path></svg>
      <svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="feather feather-chevron-down"><polyline points="6 9 12 15 18 9"></polyline></svg>
    </button>
    <ul class="dropdown-menu show" aria-labelledby="dropdownMenu1" data-bs-popper="none">
      <li><button class="dropdown-item" type="button" onclick="window.location.href=window.location.pathname.replace('/es/','/');">English</button></li>
      <li><button class="dropdown-item" type="button" onclick="let t=0; window.location.href= window.location.pathname.replace('/es/','/').replace(/\//g, match => ++t === 2 ? '/es/' : match);">Spanish</button></li>
    </ul>
  </div>
</div>

I could imagine using the yaml config to define the menu test and diversion locations:

lang:
- english:
    directory: ''
    name: 'English'
- spanish:
    directory: 'es'
    name: 'Spanish'

This is also flexible enough to allow me to do special things for HPC Carpentry:

lang:
- english_slurm:
    directory: ''
    name: 'English/Slurm'
- english_pbs:
    directory: 'en_pbs'
    name: 'English/PBS'

This introduces a lot of moving parts, but I have gotten it to work in principle.

For me, I will probably write a python script to inject the div after building the lesson (luckily there is only 1 div with the selector-container class in each page, so I just need to do that for each html file found in the build directory)

joelnitta commented 1 year ago

Hello! As @zkamvar mentioned I'm also interested in this topic. I have also been recently trying Crowdin.

A few comments...

Crowdin supports markdown, but not rmarkdown as far as I could see so you'd need some CI steps to create a processed markdown branch that can be passed to crowdin

I think if we can get Crowdin to recognize an Rmd file as a md file (in other words, to treat .Rmd the same as .md), the parser should work for most cases. @kozo2 and I are looking into this for translating some lessons (e.g., https://bioconductor.crowdin.com/targets-workshop). The Rmd parsing isn't working yet but we've made a request to Crowdin. We are similarly feeling optimistic about Crowdin, as it integrates pretty easily with GitHub.

UPDATE (2023-11-15): Rmd parsing now works, if you set up the crowdin.yml file correctly (example)

RE: deploying the translation. @maelle has developed an R package, babelquarto, to do this for Quarto. It currently only supports Quarto books (like the rOpenSci devguide). @zkamvar is there any plans to enable workbench-type rendering for Quarto? In addition to @maelle's package, there is an open issue about this at quarto, but not clear when/if it will get implemented. If workbench supported Quarto, one of these approaches could work for deployment.

BTW, I don't advise trying my {dovetail} package. I am probably going to move away from translating via local PO files as Crowdin seems like a better solution. UPDATE (2023-11-15) dovetail has been massively refactored to only perform deployment of translations created by Crowdin. It should work fine for this purpose.

yabellini commented 1 year ago

Here is a tech note we published today about How to Translate a Hugo Blog Post with Babeldown https://ropensci.org/blog/2023/09/26/how-to-translate-a-hugo-blog-post-with-babeldown/

Perhaps it is useful for this discussion. :-)

ocaisa commented 1 year ago

@yabellini Cool, I did something similar with Python in https://github.com/ocaisa/translate_md , I was planning to turn that into a GitHub Action next.

zkamvar commented 1 year ago

Hi all, thank you for filling this space with more comments and implementations. I'm really excited to see the energy in this space continuing to build.

A couple of things I do want to point out because with translations, there are a lot of moving parts. I know that on The Carpentries Core Team end, @acrall, @tobyhodges, and @froggleston have all been involved in discussions about this and we will try to make sure that efforts are not duplicated.

  1. The {babeldown} package that @yabellini mentions uses the exact same parsing mechansim that supports The Workbench ({tinkr}, created by @maelle and maintained by me), so it is more likely to use that as an extension in the future for supporting translation generation (but does not address deployment)
  2. Q from @joelnitta

    is there any plans to enable workbench-type rendering for Quarto?

    A: It was a plan, but that is not likely to happen any time soon because it requires a significant chunk of time to design, implement, and test: (see https://github.com/carpentries/sandpaper/issues/161 for the issue and https://zkamvar.github.io/isc-proposal-workbench-2022/ for a rejected proposal to the R Consortium ISC).

  3. When considering deployment mechanisms for translation, it's important to leave room for considering how this would play with versioning (see https://github.com/carpentries/sandpaper/issues/216) as @ocaisa has already discovered in their comment.
  4. @ocaisa is correct that the place for the language dropdown is next to the instructor view. You can actually see it in some of the early wireframes for the design. It was never implemented because the question of where the translations would live and how to implement them is a murky (see https://github.com/carpentries/workbench/discussions/6 for a discussion on that)
  5. I have reserved the lang: keyword to indicate the language that a lesson is written in for when https://github.com/carpentries/sandpaper/issues/205 is implemented (see swcarpentry/r-novice-gapminder config). A keyword like translate: may be better for this.
joelnitta commented 12 months ago

Update: @kozo2 and I have made progress using Crowdin as a translation platform for Workbench lessons, including deployment via {dovetail}. You can read about it in this unofficial guide to translation: https://hackmd.io/@joelnitta/SkCSC6ZNT

maelle commented 11 months ago

Relevant rOpenSci community call today in a bit more than 2 hours: https://ropensci.org/commcalls/nov2023-multilingual/

tobyhodges commented 11 months ago

Thanks for posting your update here @joelnitta.

Translation with CrowdIn and deployment via {dovetail} both look very cool. It is certainly a promising lead to follow for integrating translations into Workbench sites. I reported back to the rest of the Curriculum Team about our recent conversation*. As you know, capacity for Workbench development is currently severely limited and the Maintainers' focus is on documentation and sustainability. Nevertheless, the maintainers intend to explore this further in the first quarter of 2024, with goal of better understanding what would be required for deployment and, hopefully, implementing something after that.

*Joel, @froggleston, I and several others met to discuss this earlier in the month: sharing the notes from that call for anyone following this thread who is interested. Unfortunately I forgot to hit the 'record meeting' button - sorry!

zkamvar commented 11 months ago

Thank you to @joelnitta for opening #546 two weeks ago that brings support for translation of menu elements into the development version of The Workbench. He has provided translations for Japanese, and I have started a Spanish translation based on what I could find in https://github.com/carpentries/styles-es (but it's far from complete).

You can read more about it in the development documentation: https://carpentries.github.io/sandpaper/dev/articles/translations.html

We hope to release it next Wednesday.

yabellini commented 11 months ago

Hi, I just did this PR: https://github.com/carpentries/sandpaper/pull/552 with the Spanish translations. Thanks a lot to all for this work.

froggleston commented 4 months ago

As the formal route to translate core lesson structure strings is available now, and we have an initial path to translating content with CrowdIn, I'll close this issue. This is a broad issue topic so I am mindful that issues may well crop up in future, but I would hope they would be specific to particular issues or implementations rather than this broader discussion.

I'm excited for all the great efforts that have gone into translating the Workbench and materials so far! 🎉