PhilanthropyDataCommons / service

A project for collecting and serving public information associated with grant applications
GNU Affero General Public License v3.0
8 stars 2 forks source link

Design support for internationalization on the API #1089

Open slifty opened 2 months ago

slifty commented 2 months ago

This will include error responses, but also PDC-owned strings such as base field labels.

hminsky2002 commented 1 month ago

Thoughts on Internationalization

These are my collected thoughts on internationalizing the service repo!

tldr; I think we want to track a user's preferred language via the Accept-Language header and use that to inform our translations. We would want to hardcode translation tables for errors(names, messages, descriptions) and for base fields from all supported languages to english, and otherwise keep functionality the same

Things that we are capable of internationalizing (service)

  1. Error Logs Any error that is output to the user(via an http endpoint) should be in their given language. This can and probably should be handled within the outgoing error response. Should internal errors also be internationalized?

  2. Base field labels These, to my understanding, are provided by the seed file src/database/seeds/0001-insert-base_fields.sql. It seems like we ultimately want any user to be able to submit a proposal to any instance of a pdc service, in any language. This means we don't want to provide a bunch of alternative seed files in different languages and just say 'load in the seed file in the language you want when you set up the server,' as that would limit the instance to one language. Instead, we want to keep a ground-truth base field label list and then have some middleware that translates the fields in an uploaded proposal to their english equivalent. We can know which language to translate from based on a detail passed by the user. Perhaps this detail would be the Accept-Language header (https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept-Language)

  3. Tests? If we are internationalizing the codebase then we would seemingly want to describe the tests in other languages, but that may be beyond the scope. Also, if we're only internationalizing the user experience, then we do not need to do this anyway

  4. Comments Similarly, this is only relevant if we are internationalizing the codebase, which I don't think we are

  5. Database field names ?? This seems like something we explicitly do not want to internationalize. I think we need to approach this in terms of handling data and translating it to english (more specifically, to it's english counterpart in our database) before it reaches our database.

  6. File Names, specifically types Probably not, again unless we are internationalizing the repository

  7. Data formatting, such as currency values and dates This one is interesting! I would think that since we have base fields that accept a monetary amount, such as (Cost per outcome statement' , '', 'cost_per_outcome_statement', 'number', 'proposal'), we would want to be able to present to the user their data as the appropriate currency. As far as I'm aware, we don't have currency type as a base field. Then again, just because a proposal was written in french doesn't mean that the monetary amounts they use are to be read as euros. Maybe we don't want to even consider this, but I do wonder if we want to add as a field to proposal entities the language that they were originally written in, to inform how they are displayed on the front-end, among other things.

Things that we SHOULD internationalize

After reflecting on what we actually can internationalize, I think that the most important aspects of the service to internationalize are user facing error logs (namely HTTP error code responses) and the base field labels. Again, I think the way to do this is to track user preferred language via Accept-Language header

Approach for internationalizing error logs

What needs to be internationalized in an error? Our error handling middleware, as written, returns the error status code, name, message, and details:

res
    .status(statusCode)
    .contentType('application/json')
    .send({
        name: getNameForError(err),
        message: getMessageForError(err),
        details: getDetailsForError(err),
    });

Thanks to the beauty of standards, we don't have to worry about the status code. We do have to worry about everything else. Names, for example, are grabbed from the error constructor, which is fairly nifty, and I think speaks to the point that we don't want to change the existing code to be more 'internationalized,' rather we want to add a layer of translation over things. So, I think we would want to add to each of the three getter functions logic to translate the output of their result based on the desired language (which again, I think we want specified via the Accept-Language header).

the flow would be something like

  1. Error is thrown, error handler middleware is called, request has 'Accept-Language: fr'
  2. we getNameForError(err)
  3. We retrieve the English language name for the error
  4. We put that into a lookup table for English to french
  5. We return the output of the lookup table

This seems safest since we can ensure there will always be a valid translation, as we will be controlling which content is translated on both ends. This becomes trickier when we deal with translating base fields

Approach for internationalizing base fields

This is a much trickier issue, and I think the way to do this is: Much like how we have an official list of base fields we provide to the user in English, we have a list of base fields in all supported languages we provide to any given user. We are then able to once again control the form of the data coming in, and then using a lookup table, translate to English. If there are fields that don't yet exist in the database, we can add them as we would fields in English, but I am not sure then what the best approach is when having different fields that are simply the same field in different languages

hminsky2002 commented 1 month ago

@slifty @jasonaowen @bickelj I have collected my thoughts on internationalization here, no need to read the whole thing but wanted to draw attention to it before the check in on friday!

hminsky2002 commented 1 month ago

Meeting Conclusions

After a full-team meeting we have clarified our goals and path forward. Firstly, we are decidedly looking to localize only user-facing pdc data, not the codebase. This includes error messages, but our top priority is to internationalize the base-fields, as those are among the key functionality of the pdc.

Internationalizing base fields

The approach we have settled on is to refactor the existing base-field type, which only supports a singular value for it's content, to accommodate multiple localizations, which would be represented as a list of foreign keys in a new table, base_field_localizations. We would drop the label column from base_fields, and create the new table base_field_localizations with a foreign key on base_fields

where the base_field_localizations table looks like id base_field_id language localization
1 1 en organization name
2 1 fr nom de l'organisation
3 1 sp Nombre de la Organización

A few design choices that came up in discussion regarding this plan:

  1. Using a library to provide an up-to-date list of supported languages for the localizations table
  2. Processing user preferences for language -- we should incorporate the accept-language header as the default way of determining language preference, but also accommodate switching preferences on the front-end

Further Internationalization

Ultimately the user-experience is going to have to be localized, but as we are not emphasizing front-end development this phase it is not a top priority.

slifty commented 1 month ago

Good summary overall @hminsky2002!

Some notes on the details:

  1. localizations table should be called base_field_localizations

  2. We don't need the localizations column in the base fields table (this would violate third normal form). We already have that relationship stored via the base_field_localizations table, so you can easily query the localizations associated with base fields using a JOIN base_field_localizations ON base_field_localizations.base_field_id = base_field.id

jasonaowen commented 1 month ago

Yes, thanks for writing this up, @hminsky2002! I think you captured everything we talked about.

The only thing I'd add is that the new table doesn't need a proxy id column; we already need a UNIQUE constraint on (base_field_id, language), so we can just make that the PRIMARY KEY!