foodcoopsat / foodsoft_hackathon

Other
1 stars 2 forks source link

Translations for article units #10

Closed lentschi closed 4 months ago

lentschi commented 2 years ago

Translation of article units will be provided as manually compiled files (one for each existing languange currently in foodsoft) with UNECE units as keys and the following fields as value:

Additionally translations for unit aliases as they exist in current foodsoft instances need to be provided (see last point in #25)


EDIT 07-05-2023:

This issue has quite a long ongoing discussion. I'll try to summarize this in concise TODOs:

lentschi commented 1 year ago

As soon as this has been clarified, define how this should be implemented for #18 as well.

twothreenine commented 1 year ago

I started working on this which also includes selecting which units are relevant. In our recent call, we favored three levels:

  1. selection of units relevant for food coops in Europe (metric system) which will be enabled by default
  2. selection of "recommended" units which could be relevant in the broadest sense (e.g. lb, cm, m², h, kWh etc.) for which we will provide a translation
  3. all other UN/ECE units (selection of annexes?)

When looking for existing translations for package units, I found the GS1 package type code list which seems to be related to the UN/ECE code list. I found a German translation from GS1 Austria of a part of the units provided in UN/ECE Annexes V+VI: https://www.gs1.at/sites/default/files/2022-04/GS1-Sync-Profiles-Overview-Mai-2022-V24032022.xlsx (see sheet PackageTypeGDSN) There are also some additional units with triliteral codes like BBG: Bag in box. I skipped these for now, but I think they could be useful. @lentschi, you mentioned determining if a unit is a piece unit by checking whether the code starts with an X -- this would not work here. I'd suggest you change that to checking whether the unit has a conversation factor unit so that we could include units like BBG (immediately or in the future). I also wonder if we should just use two-letter codes like BO instead of XBO since they seem to be more popular.

When suggesting English and German locales, I made the following design choices:

I also found cases for providing symbol locales:

I also think that unit locales should have an array field alternative symbols for cases like dag/dkg, dt/dtn, m²/qm or cm³/cc/ccm. When an article list is imported, units like these could be recognized by checking alternative symbols of different locales. What do you think about that @lentschi ?

I have a bit left to work on this, then I'll send you my proposal for levels and locales.

lentschi commented 1 year ago

When looking for existing translations for package units, I found the GS1 package type code list which seems to be related to the UN/ECE code list. I found a German translation from GS1 Austria of a part of the units provided in UN/ECE Annexes V+VI [...] (see sheet PackageTypeGDSN)

I'm not sure, what kind of standard this GSDN is and how widespread it is...? Great if it provides translations for the units we need though! :+1:

here are also some additional units with triliteral codes like BBG: Bag in box. I skipped these for now, but I think they could be useful. @lentschi, you mentioned determining if a unit is a piece unit by checking whether the code starts with an X -- this would not work here.

I would prefer not to complicate things further and stick with the standards UNECE recommendations 20 and 21. (Adding other standards would mean running into the risk of duplication/ambiguity). Even if we would add them: As a conversion-less unit, BBG would just be prefixed with an X in any case and become XBBG. (see next point)

I'd suggest you change that to checking whether the unit has a conversation factor unit so that we could include units like BBG (immediately or in the future)

Two things in response to that:

  1. Checking whether a unit has a conversion factor already is what's usually happening when checking if a unit is "a piece unit". (Only in the migration I had resorted to the less expensive method of checking the first character. But this will be changed in #25 anyway)
  2. The X in XBO is actually not part of the unit code in Rec21. However Rec20 defines (in its "Intro" sheet, Point 2) that an X should be prepended to Rec21 units to avoid duplication.

I also wonder if we should just use two-letter codes like BO instead of XBO since they seem to be more popular.

Not quite sure, why you think them to be more popular...? In any case, when communicating through an external interface we can always strip the X if required. (Though I doubt it would be.)

Unit names should be compact [...] :+1:

For some package units there are multiple German translations [...] I hope this won't cause any confusion in a possible future interoperability use case.

What do you mean by 'interoperability use case'? When communicating through an external interface with a remote applications, we will hopefully be able to communicate using the UNECE codes. If the remote app doesn't understand UNECE codes, we will have to use some kind of mapping algorithm like we did before.

I also think that unit locales should have an array field alternative symbols for cases like dag/dkg, dt/dtn, m²/qm or cm³/cc/ccm. When an article list is imported, units like these could be recognized by checking alternative symbols of different locales. What do you think about that @lentschi ?

While I do understand there might be some use cases for that, I'd vote for not providing such alternative fields for now for three reasons:

  1. We'd have to enter them manually or use yet another data source (-> effort!)
  2. Article list imports through CSV is a bad idea to start with. That being said, after this fork, lists will be exported with the current locale's representation of the unit name (e.g. "kilogram" instead of "kg"). Importing will only work for those unit names (They must be unique in all translations). Old CSV file formats will not be supported for import (AFAIK there has never been backwards-compatibility for those imports. That's something you cannot properly do in CSV. We'd have to shift to JSON/YAML exports including schema version info to be able to have that.)
  3. I don't see any other use case in which we really need the abbreviated units apart from display in the dropdowns.

I have a bit left to work on this, then I'll send you my proposal for levels and locales.

Great, thank you so much for your work! :)

twothreenine commented 1 year ago

The X in XBO is actually not part of the unit code in Rec21. However Rec20 defines (in its "Intro" sheet, Point 2) that an X should be prepended to Rec21 units to avoid duplication.

I see. For example, in Annex I there's AE for ampere per metre, while in Annex V/VI there's (X)AE for aerosol. So we need the X for it to be unambiguous.

What do you mean by 'interoperability use case'? When communicating through an external interface with a remote applications, we will hopefully be able to communicate using the UNECE codes. If the remote app doesn't understand UNECE codes, we will have to use some kind of mapping algorithm like we did before.

I meant if you create articles in Foodsoft in German and select certain units because the German locale fits best (for example Kübel since that term is more common in Austria) and then export that data and show it in English, bin might be less fitting that bucket. But that problem would also occur if we translated both bin and bucket to Eimer -- you wouldn't know which Eimer you'd select. To be more precise, we'd have to display the English unit name as well in the dropdown but that would take up too much space.

I'd vote for not providing such alternative fields for now for three reasons: 1. We'd have to enter them manually or use yet another data source (-> effort!)

I have already done that for a number of units in English and German.

after this fork, lists will be exported with the current locale's representation of the unit name (e.g. "kilogram" instead of "kg"). Importing will only work for those unit names (They must be unique in all translations)

I think that's a bad idea for multiple reasons:

  1. The unit name is longer than the unit symbol or common code and makes the CSV less readable as a spreadsheet
  2. Article lists could not be exported and imported if a different locale is used. If it depends on the user's locale, that would even lead to problems if you send the CSV to a user who uses Foodsoft in a different language (and many people in Austrian food coops use Foodsoft in English since it is the default locale when an account is created and many don't bother to change it). If it depends on the instance's locale, this would still mean that you cannot import it in an instance which uses a different locale.
  3. It would take additional effort to make sure all unit name locales are unique in a certain language. In some cases it would also be difficult to find different words for units which translate to the same word in a certain language. For example, I have translated the units bundle and truss both to Bündel. I'd have to name them Bündel 1 and Bündel 2 or come up with different terms although Bündel might be the best translation for all three.

Instead, I'd propose to use the common code for exports since it is the only term that is always the same across different languages. You could also make a popup before the export where you can select whether you want common codes or symbols (or even names) if you want to implement that.

For imports, I'd vote for checking for each unit:

  1. if there's a matching common code (case-sensitive, upper case)
  2. if not, if there's a matching symbol in the instance's unit locale (case-insensitive)
  3. if not, if there's a matching alternative symbol in the instance's locale (case-insensitive)
  4. if not, if there's a matching symbol (or alternative symbol) in another locale (case-insensitive)
  5. if not, if there's a matching alternative symbol in the another locale (case-insensitive)
  6. if not, if there's a matching unit name in the instance's locale (case-insensitive)
  7. if not, if there's a matching unit name in another locale (case-insensitive) and take the first match. If there's still no match, take the entered unit as a custom unit.

When transforming a price list from a supplier to a CSV for import, it would be very handy if you could just take the unit symbols from the price list and expect it to work in most cases.

Old CSV file formats will not be supported for import (AFAIK there has never been backwards-compatibility for those imports. That's something you cannot properly do in CSV. We'd have to shift to JSON/YAML exports including schema version info to be able to have that.)

Well, we could offer a checkbox or dropdown where the user can select that they're about to upload a CSV in the old format and apply the old logic. I don't think that's too important, though. There also seems to be some backwards compatibility since columns L and M are reserved.

I don't see any other use case in which we really need the abbreviated units apart from display in the dropdowns.

I'd propose to use the symbols in most menus (balancing page, order PDF etc.) instead of the longer names. I think the names should only appear in the article edit menu and the conversion popup.

lentschi commented 1 year ago

I'd vote for not providing such alternative fields for now for three reasons: 1. We'd have to enter them manually or use yet another data source (-> effort!)

I have already done that for a number of units in English and German.

I'm not talking about the effort of finding these alternate unit codes, but about the effort implementing them in foodsoft. So far we have two data sources: The UNECE files and translations for fields in those files. Adding alternate fields for each unit code would add another layer of complexity, which might have its benefits, but which I would move to a later stage of the implementation. Also I haven't yet heard of a use case that would ever justify this effort. (Just being able to enter dag as well as dkg is not really worth it IMO.)

Instead, I'd propose to use the common code for exports since it is the only term that is always the same across different languages.

Technically that would actually be the easiest way to implement it. I'd still rather not do it though: You seem to use CSV export primarily as an export/import feature. But keep in mind, some users may be using it as a human readable export format (So users can view their articles in Excel without much effort.). UNECE unit codes however are not human readable.

For imports, I'd vote for checking for each unit: [... points 1-7 ...]

All this might make sense and seems well thought through, but I won't implement it that way - at least not in this fork (too much effort!). I want to stick with a simple one-dimensional mapping approach. (e.g. the one I proposed - or if you prefer exporting the unit symbols instead).

I'd propose to use the symbols in most menus (balancing page, order PDF etc.) instead of the longer names. I think the names should only appear in the article edit menu and the conversion popup.

Sure, I could change that. I originally chose displaying "gram" over displaying just "g" in most places as the single letter seemed a bit lost. The only advantage I see in displaying the symbol only is that we gain some horizontal space, which is admittedly a bit scarce in the group order form.

Another reason I didn't use symbols in the first place: UNECE 21 doesn't have any. (We'll have to add extra logic wherever we display units: "If there's no symbol, display the name after all.")

Then, there is the problem, that UNECE is funny about the categorization of units sometimes: E.g. PTN (a portion of food) would in my opinion rather belong to UNECE 21, but they put it in 20. So it actually does have a unit symbol, but one that's not human readable: PTN. (I'm not sure if that's just an error - it's the first case I find.)

twothreenine commented 1 year ago

I originally chose displaying "gram" over displaying just "g" in most places as the single letter seemed a bit lost. The only advantage I see in displaying the symbol only is that we gain some horizontal space, which is admittedly a bit scarce in the group order form.

I think "2 kg" is much more common and better readable than "2 Kilogramm", so I'm all for symbols.

We'll have to add extra logic wherever we display units: "If there's no symbol, display the name after all."

I guess that logic should be in the unit model and referenced wherever units are displayed.

Then, there is the problem, that UNECE is funny about the categorization of units sometimes: E.g. PTN (a portion of food) would in my opinion rather belong to UNECE 21, but they put it in 20. So it actually does have a unit symbol, but one that's not human readable: PTN. (I'm not sure if that's just an error - it's the first case I find.)

PTN isn't a packaging unit, though. I haven't included any units from UNECE 20 Annex II & III anyway. Here are my lists so far (selection and translations): Scalar units EN + DE.xlsx Piece units EN + DE.xlsx Let me know if you'd like to change anything.

lentschi commented 1 year ago

I guess that logic should be in the unit model and referenced wherever units are displayed.

It's not a technical question (It wouldn't be too hard to design), but one of effort, testing, maintenance etc. But I'll see what I can do.

I haven't included any units from UNECE 20 Annex II & III anyway.

Okay, but I wouldn't ban them as such, would you? (See samples for PTN/STC below). So, if we're to actually display the symbols from UNECE, I'll have to implement a logic that ignores symbols that are all caps and also use the unit name in such cases.

Here are my lists so far (selection and translations)

Thank you for the translations! :+1: :partying_face:

Let me know if you'd like to change anything.

My original selection included a few that are missing from yours - here they are (with German samples from our foodcoop's DB):

But I can add translations for those five myself, no worries :)

Another thing we'll need is the categorization into metric (e.g. 'kg'), imperial (e.g. 'pound') and units of neither category (e.g. 'year', 'piece'), so we can pre-set available units as we agreed. But I'll try to do that on my own too and let you review my results as soon as I'm done.

twothreenine commented 1 year ago

So, if we're to actually display the symbols from UNECE, I'll have to implement a logic that ignores symbols that are all caps and also use the unit name in such cases.

That wouldn't work for units like register ton (RT) or megawatt (MW). I think it would be better to provide alternative symbol locales in such cases.

XCB - Crate, beer (Bierkiste)

I've omitted all packaging units that refer to the content. IMO we don't need those since the content will already be specified in the article name. (beer in a crate is enough information, we don't need beer in a beer crate) Otherwise, the list would be longer (milk crate, fruit crate etc.)

XPU - Tray (z.B.: Schale mit Pilzen ohne exakte Gewichtsangabe)

I translated basin with Schale and tray with Tablett and included basin in the selection. But perhaps tray is more commonly used for food packaging. If you translate tray with Schale and include it in the selection, then we could translate basin with Schüssel (not selected).

XPT - Pot (z.B.: Topf, bepflanzt mit Kräutern-/Blumen) PTN - Portion (z.B.: Eine Portion Apfelstrudel, kein ganzes Stück, sondern eben nur eine Nachspeisenportion 😄 ) STC - Stick (z.B.: Eine Stange Salami ohne exakte Gewichtsangabe)

Good catches 👍

Another thing we'll need is the categorization into metric (e.g. 'kg'), imperial (e.g. 'pound') and units of neither category (e.g. 'year', 'piece'), so we can pre-set available units as we agreed.

I thought we'd pre-set only the really common units. For the metric system I'd suggest kg, dag, g, l, ml, dl, and cl (perhaps also hg, mg, dt) and I think these could be activated in all foodsofts. even if some use predominantly imperial units. Perhaps you could implement (in the migration) checking if there's any other unit symbol (from the translated scalar units) that's used and activating it in that case? Then we wouldn't need any categorization.

lentschi commented 1 year ago

So, if we're to actually display the symbols from UNECE, I'll have to implement a logic that ignores symbols that are all caps and also use the unit name in such cases.

That wouldn't work for units like register ton (RT) or megawatt (MW).

True, but I doubt anyone would miss those two and I don't see a problem with the more commonly used units. Also I could adept the logic to only disregard the symbol, if it's the same as the UNECE CommonCode.

I thought we'd pre-set only the really common units. For the metric system I'd suggest kg, dag, g, l, ml, dl, and cl (perhaps also hg, mg, dt) [...] Then we wouldn't need any categorization.

Yeah, I didn't necessarily mean that we have to categorize them all. But we need those initial groups. But I'll come up with something, and you can review as soon as I deploy it to the demo server :smile: - I think it's easier that way.

lentschi commented 4 months ago

Spanish, French, Dutch and Turkish added by free API translation in https://github.com/foodcoopsat/foodsoft_hackathon/commit/3fca7de2f7ad19d9f8301911c936526327be532a