ERDDAP / erddap

ERDDAP is a scientific data server that gives users a simple, consistent way to download subsets of gridded and tabular scientific datasets in common file formats and make graphs and maps. ERDDAP is a Free and Open Source (Apache and Apache-like) Java Servlet from NOAA NMFS SWFSC Environmental Research Division (ERD).
Creative Commons Zero v1.0 Universal
83 stars 58 forks source link

Translate ERDDAP user interface into other languages #50

Closed BobSimons closed 2 years ago

BobSimons commented 3 years ago

Most of the text for the ERDDAP user interface (other than a few static .html documents) is in messages.xml. Long ago, in my previous job, I was able to use commercial machine translation software to make translations to French, Spanish, German, Italian, and Dutch. The translation was crude but usable. For a couple of the languages, a human hand edited the translations. That was great.

Now there are web sites (and hopefully web services) that can translate to/from 100's of languages and they can do a much, much better job than the software I used.

The goal of this project is to create versions of ERDDAP using different languages. (This was the part of the original design of ERDDAP. That's why most of the text is in messages.xml.) There are two major tasks:

Major task # 1) Make a system which parses messages.xml, sends the parts to a machine translation system, and creates variants of messages.xml for other languages (e.g., messages.de.xml). With this alone, a given ERDDAP could have a user interface which uses another language.

Major task # 2) (Optional) Make a structural change to ERDDAP so that one ERDDAP can serve messages in different languages, e.g., https://baseUrl/erddap/... would be the English (original) interface while https://baseUrl/erddap/de/... would be the German interface, etc. But some other actual system might be better.

Clearly, many people outside the US would appreciate this (even the Brits, who could switch to the British spellings of some words).

There are several interrelated issues for major task # 1. Different people could take on different issues.

1) There is still text in ERDDAP which has hard-coded text. These should be converted to use text from messages.xml. The easiest way to identify these is to translate messages.xml and then see what text in the ERDDAP UI still needs to be translated.

2) Sometimes, 2+ tags in messages.xml are used to hold parts of a sentence. ERDDAP combines the parts with other information to make a sentence. This often doesn't work well with translation software because different languages put different parts of the sentence in different places (e.g., Germans put the main verb at the end). Hopefully, there are few of these. These should probably be changed to the pseudo character entity system or the MessageFormat system described below.

3) Often, a message exists with a whole sentence or block of text in messages.xml, but uses a pseudo character entity (e.g., &externalLinkHtml;) to mark where ERDDAP will plug in some piece of information at runtime. This works with translation software if it simply ignores the unknown character entity and passes it through in the results. You'll have to test that this works with the translation system you choose, or deal with the problem (e.g., substitute a nonsense word temporarily).

4) Often, a message uses Java MessageFormat-style substitution placeholders like {0} and {1}. In these messages, any single quote ' is written as ''. So the translation system will have to deal with the placeholders (or you will) and you'll have to convert to one single quote before translation and convert to 2 single quotes after translation.

5) Often the message is HTML content. The translation system will have to preserve the HTML tags.

6) To do the actual translation, you probably have to write a script which goes through messages.xml, and for each tag, cleans it up (eg convert 2 single quotes into 1 single quote), passes it to the translator, then inserts the result in a new messages.xml (e.g. with each single quote converted to 2 single quotes).

7) It would be nice to have a system which makes it easy to translate the entire messages.xml or translate/update just a single tag. Thus, if one tag is changed for a new release of ERDDAP, there should be a way to update the translated version in the other versions of messages.xml.

Skills required: Script writing (Java?). Fluency in (or at least familiarity with) a language other than English (to use as the test case) is useful. You need to know a little Java to deal with issues # 1 and # 2.

Difficulty: This is a huge task if done properly (lots of editing of message formats), but only medium technical difficulty. This is easily 3 months work (after after getting rolling with ERDDAP). Maybe this is too big. I don't know. Or just dive into this and see how far you can get in the available time. I suspect that just doing # 6 and # 7 (and dealing with issues # 3, # 4, # 5) might be a useful project on it's own (which you might be able to do crudely in 1 week). It would also highlight pieces of text that are hard coded in ERDDAP (issue # 1) or which need modifying in ERDDAP (issue # 2).

Mentor: Bob Simons (main author of ERDDAP)

Please also read the Programmer's Guide at https://coastwatch.pfeg.noaa.gov/erddap/download/setup.html#programmersGuide especially the "Judging Your Code Contributions" section.

Q1Zeng commented 3 years ago

Hi, My name is Qi Zeng, a math major sophomore from Georgia Institute of Technology, on track to pursue a second major in CS. I'm interested in doing the GSoC to learn something outside my school and make some contribution to an open source community. I'm proficient on English (6 years in the U.S.) and Chinese (mother tongue), and I have experience with Java. So I think I can potentially be a good fit for this translation project. Can we discuss more details about this project? For example, what I can do to get started with this issue? @BobSimons

BobSimons commented 3 years ago

@Q1Zeng, Thanks for your interest in working on this project. My understanding of GSoC (from https://summerofcode.withgoogle.com/how-it-works/#timeline ) is that students like you submit applications until Apr 13. Then we review and select the best student proposals by May 17. If you are accepted, we "bond" from May 17 - June 7 and actually work on the project together from June 7 to Aug 16. So, what you can do now is:

Best wishes.

jarvis-001 commented 3 years ago

Hi @BobSimons I am Ghanshyam Singh Moyal , sophomore at Indian Institute of Technology, Roorkee . I have a good experience with java . About languages I am proficient in English and Hindi (my mother tongue) . I would love to work on this project .

BobSimons commented 3 years ago

@jarvis-001, thanks for your interest in working on this issue, Unfortunately, it is tentatively assigned to @Q1Zeng. Please consider working on one of the other ERDDAP issues, especially those marked with GSoC. See https://github.com/BobSimons/erddap/labels/GSoC Best wishes.

BobSimons commented 2 years ago

Qi (as a Google Summer of Code intern) and I worked on a translation system in the summer of 2021. The resulting system was released in ERDDAP v2.15. Thank you, Qi!