MobilityData / gtfs-realtime-validator

Java-based tool that validates General Transit Feed Specification (GTFS)-realtime feeds
Other
38 stars 9 forks source link

Error messages and descriptions in other languages #109

Open AntoineAugusti opened 2 years ago

AntoineAugusti commented 2 years ago

We @ transport.data.gouv.fr would be interested to have error messages, descriptions and examples from the JSON report in other languages. I'll let you guess the language we are interested in 🙃

Would it be possible for the community to translate these things and specify the language we are interested in when validating data?

We would love to be able to have reports in multiple languages as well, to avoid running the validator multiple times if we are interested in multiple languages.

cc @fchabouis @thbar

thbar commented 2 years ago

Thanks for opening the issue !

Keeping just the error ids + English messages in the default report, but adding a community-maintained translation in the repo (e.g. a YAML/JSON/po files), and letting the developers translate at display time, could have the same type of usefulness without actual changes in the report (and less impact on the project).

It could be managed in this repo, or be a sort of "side project" outside the repo.

Just ideas at this point !

barbeau commented 2 years ago

@AntoineAugusti @thbar I'd love to see support for multiple languages integrated directly into the validator and in a way that would support easy third-party contributions. I think it's definitely possible, it would just take some refactoring within the project and an agreement on how the translations would be stored.

As documented in https://github.com/MobilityData/gtfs-realtime-validator/tree/master/gtfs-realtime-validator-lib#output, right now the JSON output looks like this:

[ {
  "errorMessage" : {
    "messageId" : 0,
    "gtfsRtFeedIterationModel" : null,
    "validationRule" : {
      "errorId" : "W001",
      "severity" : "WARNING",
      "title" : "timestamp not populated",
      "errorDescription" : "Timestamps should be populated for all elements",
      "occurrenceSuffix" : "does not have a timestamp"
    },
    "errorDetails" : null
  },
  "occurrenceList" : [ {
    "occurrenceId" : 0,
    "messageLogModel" : null,
    "prefix" : "trip_id 277716"
  }, {
    "occurrenceId" : 0,
    "messageLogModel" : null,
    "prefix" : "trip_id 277767"
  }, {
    "occurrenceId" : 0,
    "messageLogModel" : null,
    "prefix" : "trip_id 277768"
  }, 

In the above example, three trip_updates have been validated, and each was missing a timestamp (warning W001). To put together the full message for each occurrence of the warning or error, you add the occurrence prefix to the validationRule occurrenceSuffix.

For example, in UI format the above would look like:

This is a relatively simple example where the prefix doesn't even need to be translated, and as long as you can create a suffix in the translated language that grammatically joins with the prefix you'd really only need a translated suffix.

All the suffixes are defined here as the last parameter passed into the constructor for each rule (other general rule descriptions are also configured via the same constructor): https://github.com/MobilityData/gtfs-realtime-validator/blob/master/gtfs-realtime-validator-lib/src/main/java/edu/usf/cutr/gtfsrtvalidator/lib/validation/ValidationRules.java

The prefixes are all currently defined where the rule is implemented in the code in the rules package. For example, here are the timestamp prefixes: https://github.com/MobilityData/gtfs-realtime-validator/blob/master/gtfs-realtime-validator-lib/src/main/java/edu/usf/cutr/gtfsrtvalidator/lib/validation/rules/TimestampValidator.java#L83

Some of those have more complex sentence structures where it would be harder to simply translate a prefix or suffix alone. For example, looking at E022 for "this stop arrival time is < previous stop arrival time", here's the prefix:

 String prefix = id + stopDescription + " arrival_time " + arrivalTimeText + " (" + arrivalTime + ") is less than previous stop arrival_time " + previousArrivalTimeText + " (" + previousArrivalTime + ")";

...and the suffix is just "- times must increase between two sequential stops".

So as far as an implementation goes it would be a matter of pulling those values into key/value pairs and then defining a format for integrating the values into a translated string. All of my internationalization experience is on Android, but I think we could leverage the Java internationalization framework for this and store translations in .properties files: https://www.baeldung.com/java-resourcebundle

Using Java's framework would help automatically handle items like , instead of ., default date formats, etc. and translation framework providers like Transifex should support it (see more on this below).

In terms of output format, is anyone aware of a standardized translation format for JSON response elements? I haven't done translations within JSON data before and couldn't easily find one. If one doesn't exist, we could mirror the GTFS Realtime Service Alerts format, which looks like this:

    header_text {
      # multiple languages/translations supported
      translation {
        text: "Stop at Elm street is closed, temporary stop at Oak street"
        language: "en"
      },
      translation {
        text: "L'arrĂȘt Ă  la rue Elm est fermĂ©, l'arrĂȘt temporaire Ă  la rue Oak"
        language: "fr"
      },
    }

So an equivalent for this project would be something like:

{
  "errorMessage" : {
    "messageId" : 0,
    "gtfsRtFeedIterationModel" : null,
    "validationRule" : {
      "errorId" : "W001",
      "severity" : "WARNING",
      "title" : [
          {
              text: "timestamp not populated",
              language: "en"
          },
          {
              text: "horodatage non renseigné",
              language: "fr"
          },

      ],
      "errorDescription" : [
          {
              text: "Timestamps should be populated for all elements",
              language: "en"
         },
          {
              text: "Les horodatages doivent ĂȘtre renseignĂ©s pour tous les Ă©lĂ©ments",
              language: "fr"
         },
       ],
      "occurrenceSuffix" : [
          {
              text: "does not have a timestamp",
              language: "en"
          },
          {
              text: "n'a pas d'horodatage",
              language: "fr"
          },
     ],
    },
    "errorDetails" : null
  },
  "occurrenceList" : [ {
    "occurrenceId" : 0,
    "messageLogModel" : null,
    "prefix" : [
         {
             text: "trip_id 277716",
             language: "en"
         },
         {
             text: "trip_id 277716",
             language: "fr"
         }
    ]
  }, {

This would obviously get more complicated if translations don't fit neatly into prefixes and suffixes - I think in that case we'd need to change the output format. But if we're targeting Western languages first I think keeping it close to the existing format as in the example above might work - but let me know if you start looking at the rules and find examples where the prefix/suffix format just wouldn't work.

We could also try to leverage an existing translation platform like Transifex (which is free for OSS) in coordination with the format we decide on: https://www.transifex.com/

I've used Transifex in context of two OSS Android projects and it simplifies communicating with translators and makes it easier for non-developers to contribute translations. And it looks like they support the Java .properties file format: https://docs.transifex.com/formats/java-properties

Any thoughts/ideas/improvements to the above?