Make the Humanise function better

parduz commented 2 years ago

I think that a couple of changes in the Humanise function would help a lot non english users:

"return rounded numbers istead of strings": i mean, instead of "More than 2 millions and a half" the function should return "More than 2000000 and a half" and let the voice manage the pronounciation. This would solve a lot of "speaking errors" at least with my Italian voice.
"return an array": like [0] = "More than", [1] = "2000000", [2] = "and a half". This could allow an easier localization without the need to split the function output again if the order have to be altered.

The first one seems an easy task (if i'm not missing how other voices works), dunno about the second.

Tkael commented 2 years ago

I'm afraid that this doesn't work well in some languages. In English we would say "More than 2 and a half million" rather than "More than 2 millions and a half". This would be rendered improperly in English if we were to change the result to "More than 2000000 and a half".
While an interesting idea, this would be a very disruptive change and has the same problem as your first suggestion.

We've designed the translation to allow the translator some flexibilty in how the phrase is constructed, e.g. "circa {0} milione e mezzo", where {0} is the number 2 in your example. I recognize that this is still imperfect for the Italian case (where you might need to use either "milione" or "milioni"). We'll have to think about whether we can improve this area further.

One possibility is to offer translators an opportunity to assign a different translation where the leading number is '1' (e.g. "circa {0} milione e mezzo" where {0} equals 1 and "circa {0} milioni e mezzo" where {0} is not equal to 1)? While redundant for many languages, this might provide the extra degree of control that Italian and other similar languages might need. @richardbuckle Your thoughts?

parduz commented 2 years ago

1) mh... blame me. I think i had the same idea about a year ago, talked about it in the forum, got the same answer, and forgot about it.

While redundant for many languages, this might provide the extra degree of control that Italian and other similar languages might need.

Well, it could be a step... with a way to "declare" the plurals in crowdin (something like milione|milioni or milion{e|i} ?) it could work. But still italian have other issues: the voice says "1 milione" as "uno milione" which should be "un milione", "1 mila" (1000) is wrong as we just says "mille", and other difficulties too. Also, sometime the humanization should care about if there's units to talk about (so, Credits, Persons, Tons, Joules, whatever), 'cause "milioni" wants "di" (millions of) but "mila" don't :-\ And that's just the Italian, i don't know nothing about other languages but i guess that Humanize will not ever be smart enough.

I've built a "Italianise" script and replaced each Humanize call with it (with the problem of having to add a {set xxxx to yyyy } for the function parameter): that script takes the output of Humanise and make it correct. This gave me another idea (that i may already have told somewhere, but i can't recall):

What if Humanise (and perhaps some other functions like "P"?) fires a "callback" script before returning his result? Humanise could prepare some EDDI_translation_variable_ before (like the passed parameters, the array with the various phrase pieces, the integers resulting from the "humanization", and the function output): then fire the script which could do whatever the user wants altering the "proposed" output string (or do nothing, being empty by default) , and finally return whatever there's in that output string.

So, to recap this a bit: my new idea is

Humanise should have an optional string parameter which is what the passed number is about.
Humanise should fire a "callback" script before returning, to allow the user alter the function result.

Tkael commented 2 years ago

Humanise should have an optional string parameter which is what the passed number is about. I'm not sure that I understand yet how we would need to do this. Please elaborate on what you'd enter and how we would need to handle it?

Humanise should fire a "callback" script before returning, to allow the user alter the function result. Once again, I'm a little fuzzy on the details of what you are proposing. Are you saying that Humanise() would work a little like an event and trigger another script from the Speech Responder?

richardbuckle commented 2 years ago

with a way to "declare" the plurals in crowdin (something like milione|milioni or milion{e|i} ?) it could work.

Oh you sweet summer child of a language where there is only one plural 😀

As one who speaks both Italian and Russian, let me introduce you to Slavic plurals, where the inflection depends upon the last word (not the last digit) of the number, e.g Russian (in Latin alphabet):

one => nominative singular:
- 1 kg => odin kilogramm,
- 101 kg => sto odin kilogramm
two, three, four => genitive singular:
- 2 kg => dva kilogramma,
- 34 kg => tridtsat' chetyre kilogramma
anything else => genitive plural:
- 5 kg => pyat' kilogrammov,
- 12 kg => dvenadtsat' kilogrammov

Oh, and the cardinal numbers are themselves nouns and must be declined. The word for 'about' is 'okolo' and takes genitive case, so for example 'dva' becomes 'dvukh': 'about two kilograms' is 'okolo dvukh kilogrammov'.

Amazingly, Microsoft's default Russian TTS voice gets all the above right given just the left-hand side, so in the Russian translation the approach is to push as much work as possible to the TTS voice.

I bring this up not to dismiss the idea but to illustrate how incredibly hard it is to generalise.

I would certainly agree that Humanise() already has a lot of anglo-centric assumptions embedded in the very idea that just the number is sufficient as a parameter, but I am wary of going down the rabbit hole of trying to make it suit everyone's needs and failing anyhow.

parduz commented 2 years ago

Humanise should fire a "callback" script before returning, to allow the user alter the function result. Once again, I'm a little fuzzy on the details of what you are proposing. Are you saying that Humanise() would work a little like an event and trigger another script from the Speech Responder?

EXACTLY!

Humanise should have an optional string parameter which is what the passed number is about. I'm not sure that I understand yet how we would need to do this. Please elaborate on what you'd enter and how we would need to handle it?

Let me try with an example of what my envision is:

You sold it for {Humanise(1534752,"Credits")}.

Humanise do his math and calls the "ReviewHumaniseOutput" script, which could access some variable like: EDDI_Humanise_Parts[0] = about EDDI_Humanise_Parts[1] = 1 million EDDI_Humanise_Parts[2] = and a half EDDI_Humanise_Parts[3] = Credits EDDI_Humanise_Param[0] = 1534752 EDDI_Humanise_Param[1] = Credit EDDI_Humanise_RoundValue = 1 EDDI_Humanise_Magnitude = 1000000 EDDI_Humanise_Output = about 1 million and a half

The user do what they want and change the EDDI_Humanise_Output variable; these variables gives info about "what should be said". When the script ends, Humanise can return whatever there's in the output string.

It seems to me the less "invasive", pretty useful and the most compatible solution. I may not see what other languages may need, but for sure this would allow me to have a nice "Italianise" with minimum efforts.

parduz commented 2 years ago

This is my current "Italianise" script. It's "too young" so it is in "beta" stage, perhaps may explain what i need to do better than my poor english:

{_ 1000 _}

{set RegexStr to "(.+ )*([0-9]+\,[0-9]+|[0-9]+)( *(mila|.+?lione|.+?liardo)) *(e mezzo)*"}
{set theNumber to PassedNumber }
{set theUnit   to PassedUnit   }

{set Humanized to Humanise(theNumber)}

{set Italianized to match( Humanized, RegexStr )}

{if len(Italianized)=0 :
    {if find(Humanized,"000.000") > -1:
        {set Beginning     to Humanized }
        {set Quantity      to ""        }
        {set Magnitude     to ""        }
        {set AndAHalf      to ""        }
        {set BeforetheUnit to " di"     }
    |else:
        {_ dump match(Humanized, RegexStr)}
        {_ what else to do? return Humanized}
        {set Beginning     to Humanized }
        {set Quantity      to ""        }
        {set Magnitude     to ""        }
        {set AndAHalf      to ""        }
        {set BeforetheUnit to ""        }
    }
|else:
    {_ dump match(Humanized, RegexStr)}
    {_ Found }
    {set Beginning     to Italianized[1] }
    {set Quantity      to Italianized[2] }
    {set Magnitude     to Italianized[4] }
    {set AndAHalf      to Italianized[5] }
    {set BeforetheUnit to ""             }

    {if Quantity = "1" :
        {_ manage singular pronounciation _}
        {if Magnitude = "mila" :
            {if AndAHalf = "e mezzo" :
                {set Quantity to cat(Quantity,"500") }
            |else:
                {set Quantity to "mille"}
            }
            {set Magnitude to ""}
            {set AndAHalf to ""}
            {set BeforetheUnit to ""}
        |else:
            {set Quantity to " un"}
            {set BeforetheUnit to " di"}
        }
    |else:
        {_ manage plurar _}
        {if Magnitude = "mila" :
            {if AndAHalf = "e mezzo" :
                {set Quantity to cat(Quantity,"500") }
                {set Magnitude to ""}
                {set AndAHalf to ""}
                {set BeforetheUnit to ""}
            }
        |else:
            {set Magnitude to slice(Magnitude,0,len(Magnitude)-1) }
            {set Magnitude to cat(Magnitude,"i") }
            {set BeforetheUnit to " di"}
        }
    }
}
{Beginning}{Quantity} {Magnitude} {AndAHalf}{if theUnit: {BeforetheUnit} {theUnit}}.

The whole regex part returns what i would like to have already set by the new Humanise, before firing the "callback" script.

HTH :)

Tkael commented 2 years ago

Hmm. Variables in Cottle are immutable, meaning that it would not be possible for the user to set {event.EDDI_Humanise_Output }. We'd have to use SetState() to set a variable and EDDI would need to know to read a specific value from the SetState dictionary.

In terms of complexity, you may be better off sticking with your Italianise script and calculating your values from the original number.

Here's an example of how you could calculate some of the critical values for Italianise from the raw value:

{set originalNumber to 54741887}

{set value to originalNumber}
{while value >= 10:
    {set magnitude to magnitude + 1}
    {set value to value / 10}
}
Magnitude: {magnitude},

{set orderMultiplier to round(pow(10, floor(magnitude / 3) * 3))}
Order Multiplier: {orderMultiplier},

{set firstNumber to floor(value)}
First Number: {firstNumber},

{set secondNumber to floor((value - firstNumber) * 10)}
Second Number: {secondNumber},

{set thirdNumber to floor((value - firstNumber - (secondNumber / 10)) * 100)}
Third Number: {thirdNumber}.

Humanized: {Humanise(54741887)}

From these calculated numbers, we know:

The magnitude is 7 (so in the tens of millions range, we might want to use 2 significant figures)
The order multiplier is 1000000 (so our unit will be millions)
The first number in the value is 5
The second number in the value is 4
The third number in the value is 7 (more than halfway to the next significant digit)

With 2 significant figures in the millions order and our third digit more than halfway to the next significant figure, we get a humanized value of "Over 54 and a half million".

Hope that helps.

Tkael commented 2 years ago

Hmm... after going though the exercise above I think we might also be able to treat Humanise() as a special case of {F("Humanise")}, where we automatically set helpful values calculated from the original number and the translator does the rest using a Humanise script.

It would be another major re-write / disruption for translators but should be possible. Much of the work that has gone into humanizing values via CrowdIn strings would become obsolete.

@richardbuckle your thoughts?

richardbuckle commented 2 years ago

I think it would be important to get feedback from the other translation teams before embarking on such a radical overhaul. There are bound to be further language-specific issues that we are unaware of.

Tkael commented 2 years ago

I've sent a message to our proofreaders on CrowdIn to request additional feedback before we implement any changes.

yucatan commented 2 years ago

I have to say that it's not that too hard to make the adjustments in the scripts to get the proper pronunciation in Portuguese. But I am not against such changes.

Transcan commented 2 years ago

I'm Spanish and in my case I had to write my own "Humaniza" function. Spanish language has plurals and gender, and so do the number's spelling.

For example: 21 can be spelled as: veintiún - male, singular veintiuno - also male and singular but used in some cases veintiunos - male, plural veintiuna - female, singular veintunas - female, plural.

And that is not a regular law. I mean, is difficult to code not counting all the exceptions, even harder if you have to adapt the code to other languages.

Also, the main issue I encountered, some voices don't read them as it should (the gender doesn't match for example). So I coded inside the new humaniza function my own way to spell the numbers. It converts the numbers to words and uses a flag for gender. This way, the voice will read it as I want.

The tricky part is the invocation, because You can't give parameters to a script directly. Things like {F('Humaniza', 12345, female)} doesn't work...

I placed this at the beginning of each script that needs it:

{_ Funcion humaniza() _}
{set humaniza(n, g) to:
    {SetState("humaniza", n)}
    {SetState("humaniza_femenino", g=true)}
    {F("Humaniza")}
    {return state.humaniza_resultado}
}

And invoke it just as normal function: has comprado {humaniza(item.amount, true)} toneladas de {item.name}.

Three state variables are used: humaniza is the number _humanizafemenino is a boolean for gender _humanizaresultado is the result as a text string that the script humaniza sets.

"1500000" going through my script return "un millón y medio" while humanise() returns "1000000 y medio".

So my final words are that the thought about making the internal humanise() function some kind of a "function editable by the user via script" is a nice idea. This way each language can make or adapt his own script, or use the default if that is enough for them.

Tkael commented 2 years ago

Thank you @yucatan and @Transcan for your feedback. I'll keep thinking about this. Also happy to hear from any other translators who haven't weighed in yet!

EDCD / EDDI

Make the Humanise function better #2294