alan-if / alan

ALAN IF compilers and interpreters
https://alanif.se
Other
18 stars 3 forks source link

Special Handling of Directions and their Aliases #21

Closed tajmone closed 3 years ago

tajmone commented 3 years ago

@Thoni56, I wanted to expose a problem I've encountered with the ALAN Italian project, regarding the fact that ALAN doesn't allow having same-worded aliases, verbs, directions or special classes of known words (e.g. noise words, etc.).

  1. Handle directions and their aliases separately from verbs and other special words, allowing duplicate entries.
  2. When the parser encounters a single-word input, assume it's a direction before trying to match it to a global verb.
  3. If the single-word input matches both a direction and a global verb, throw a disambiguation request.

The Problem

The problem affects mostly directions shorthand, which clash with other common verbs or special words. To better illustrate my example, here's a table with all the main direction in Italian and English.

direction short English clashes with
nord n north
nordest ne northeast
est e east with "e" the AND_WORD.
sudest se southeast
sud s south
sudovest so southwest
ovest o west
nordovest no northwest the "no" reply.
su up
giù down

Beside the above mentioned clashes, which affect ALAN Italian practically, the direction shorthands "se" and "o" could also potentially clash with other Italian constructs — "se" means if, and "o" means or. Although I didn't encounter practical cases of the latter conflicts in real IF games, it does highlight the extent to which directions shorthand can collide with other useful and common Italian words.

I couldn't come up with any practical solution to provide directions shorthands in Italian adventures — every possible solution seems to solve some conflicts but introduce newer ones. E.g. if I decide to include an extra letter:

direction short English clashes with
nord no north the "no" reply.
nordest nes northeast
est es east
sudest ses southeast
sud su south with "su" the up direction.
sudovest sov southwest
ovest ov west
nordovest nov northwest

I've tried many other solutions, but it simply doesn't seem possible to avoid clashes while attempting to use a coherent system. Since in most languages short words represent commonly used particles, adverbs, etc., it might just as well be possible that similar problems affect other locales too — and that it might be a lucky exception that in English those directions shorthands don't clash with other words that are important to IF gameplay.

The Solution

From an IF point o view, directional commands always contain just the direction, so I was wondering if it would be possible to change how ALAN treats directions and their aliases, compared to other game objects.

If ALAN were to handle directions and their aliases as a separate category of words, when faced with a single word input the parser could first assume that it's a direction command, or an alias thereof, before attempting to match it with a verb.

This wouldn't solve the conflict between "no" the direction (northwest) and "no" the reply (to a yes/no question), but only if the "no" reply was implemented in an adventure as a raw reply — whereas if implemented as reply no or reply yes, there wouldn't be any conflict since the reply would no longer be a single word input.

As for "e" (short for east) conflicting with "e" the AND_WORD, if the parser was to consider first the possibility of it being a direction alias, the conflict would be solved — also, it wouldn't make sense to use an AND word in an input sentence with less than three word.

I'm not sure how complex these changes would be, or whether they might have an unexpected impact on backward compatibility, but having a separate list for direction words and their aliases, and the parser attempting to first match a directional command for single-word inputs, seems a reasonable change; also, the parser could always check if there's also a same-worded verb and throw a disambiguation request if this was the case — but as mentioned above, authors could easily avoid these edge cases by formulating fuller verbs (e.g. reply no, answer no).

As for other direction commands, like go no, the parser already strips the NOISE_WORD "go" from the input, and the same can be achieved in any locale — also, I believe most players will type just the direction anyhow.

thoni56 commented 3 years ago

Thanks, @tajmone, for a very detailed description. I know you often think in broad and general terms, but I wish you had started with a few very concrete examples of the problem ;-) It took me some time to examine and understand the tables, but some guesswork and induction later I think I have gotten to the core of your issue, so let me know if I'm completely of track, right?

Command Parser

The Alan command parser is not a complete natural language parser so there are of course some shortcuts taken that prevents some constructs to be parsed correctly or even constructed. Some combinations of word usages being allowed, some not, is one such shortcut.

Word Classes and Clashes

So I'm guessing therse are the things that you are referring to that prevents you from doing what you wanted (again, would really like concrete examples):

Some word types are possible to combine, some are not. A quick look in the compiler sources show these are the ones it prohibits (this should be a valuable addition to the manual, too):

Other combinations are (presumably) allowed. E,g, a quick check shows that a directional word works perfectly as an adjective. So again, I would appreciate some more concrete examples of what does not work they way you want, or is not possible.

There are also errors for redefinition of identifiers in the Alan game source. But I think those are not a part of the problem here as they are "author symbols", not "player words". Note that there is a small overlap here though, since directions automatically define themselves as directional player words. (Actually I don't think they really need to be "author symbols" since the author cannot use them for anything, Or am I forgetting something here...)

DIrectional Command Precedence

The core of the parsing algorithm is

    if (isVerbWord(currentWordIndex)) {
        parseVerbCommand(parameters, multipleParameters);
        action(current.verb, parameters, multipleParameters);
    } else if (isDirectionWord(currentWordIndex)) {
        clearParameterArray(previousMultipleParameters);
        clearPronounList(pronouns);
        handleDirectionalCommand();
    }

If you think directional commands should have precendence it would (probably) be a small change. I have not tried it yet so I don't know the impact of such a change w.r.t. game play compatibility. I could easily do that with the games that are in the test suite, but first I'd like to see the reasoning behind your preference.

Risposta

Is this a problem with synonyms only? Assuming you could not use any shorthands, synonyms, at all (and all players were content with typing full words all the time), would there still be a problem?

I think this is an important question to answer, because if not, then we can focus on the problem with not being able to define synonyms in a good way, and not be derailed by the red herring of directions ;-)

tajmone commented 3 years ago

I know you often think in broad and general terms, but I wish you had started with a few very concrete examples of the problem ;-) It took me some time to examine and understand the tables, but some guesswork and induction later I think I have gotten to the core of your issue, so let me know if I'm completely of track, right?

Sorry for that, it was because I wasn't quite sure of the underlying mechanisms, on the one hand, and the correct terminology to adopt on the other.

A practical example would be trying to define the Italian directions as in the above table (first one, first column), and then creating synonyms (as in 2nd column) — here is how ALAN Italian defines directions and their synonyms:

THE limbo ISA LOCATION
  EXIT
    nord,
    sud,
    est,
    ovest,
    nordest,
    sudest,
    nordovest,
    sudovest,
    su,
    giù,
   'in', --> per "dentro" (sinonimo)
    fuori

    TO limbo.
END THE limbo.

SYNONYMS
    nest    = nordest.
    sest    = sudest.
    novest  = nordovest.
    sovest  = sudovest.
    sopra   = su.
    giu     = giù.

and this is the definition of the Italian AND_WORDs:

SYNONYMS e, poi = 'and'.

If I were to follow the directions shorthands proposed in the table, it wouldn't compile due to the reported conflicts (last column).

You can't have words that are directional words that are also some other types of words (which type where you specifically thinking about?)

Well, various types: AND_WORD (which currently can only by defined via Synonyms, since ALAN doesn't expose their direct definitions), verbs IDs and objects names, and verb parameters.

Other combinations are (presumably) allowed. E,g, a quick check shows that a directional word works perfectly as an adjective.

If we can't define a direction shorthand via Synonyms, then it means that every time an Exit is defined the author needs to insert the dual definition (full name, plus shorthand), which is not practical and could easily lead to authors forgetting to add the shorthand, which would result in inconsistent directional commands during game-play.

Is this a problem with synonyms only? Assuming you could not use any shorthands, synonyms, at all (and all players were content with typing full words all the time), would there still be a problem?

Well the problem would be that players won't be willing to type the full words all the time, because directions are quite verbose in Italian, and playing would become quite tiresome. Also, IF players have strong expectations that all adventures should share very similar commands conventions, regardless of the system they were created with, so chances are that they will automatically keep using the shorthands they are accustomed to (over and over again).

I think this is an important question to answer, because if not, then we can focus on the problem with not being able to define synonyms in a good way, and not be derailed by the red herring of directions ;-)

Honestly, I've never fully understood the logic behind why ALAN sometime let you use same words in some constructs while not in others (didn't find much explanations in the Manual for this).

Another practical example...

The preposition from in Italian can take many forms, depending on number and gender, but to simplify syntax definitions only the simple form is used ("da") and all other variants are just synonyms:

SYNONYMS
  dal, dallo, dalla, 'dall''', dall, dagli, dalle  = da.

but in the above definition I had to omit one specific form, "dai", because it would clash the common syntax for the verb "dai" (give obj to npc):

SYNTAX dai_a = dai (ogg) a (png)

which, in practical terms, means that every time authors define a syntax which involved the from preposition, a special variant needs to be added to allow usage of "dai":

--- climb down from (surface):

SYNTAX scendi_da = scendi da  (superficie)
       scendi_da = scendi dai (superficie).

Again, all this is because if I defined preposition "dai" as a synonym of "da" (as it should be) then I wouldn't be able to implement the verb "dai" (give) because it wouldn't compile.

So, the problem with synonyms not allowing other types of same-named IDs or words spans across the whole system, is not specific to directions only. But for directions I simply couldn't find a solution, whereas for prepositions it's possible to work around them by burdening authors with the need of manually writing syntax alternatives in each verb (not elegant, but not a tragedy either).

I hope this might help clarify the problem to you — since it's not fully clear to me I struggle to define it well (it's like a black-box to me, right now).

thoni56 commented 3 years ago

Thank you, Tristano! Those are good examples and shows that my feeling of this being primarily a synonyms problem was (probably) right.

Is this a problem with synonyms only? Assuming you could not use any shorthands, synonyms, at all (and all players were content with typing full words all the time), would there still be a problem?

Well the problem would be that players won't be willing to type the full words all the time, because directions are quite verbose in Italian, and playing would become quite tiresome. Also, IF players have strong expectations that all adventures should share very similar commands conventions, regardless of the system they were created with, so chances are that they will automatically keep using the shorthands they are accustomed to (over and over again).

This was not meant as a suggestion, but as a way to get to the core of the problem by stating a hypothetical situation. I certainly know that not using shorthands is not viable. I was hoping that you could imagine that it was possible... But the rest of what you have answered have given me the information I hoped for anyway...

There are a couple of nuances here that cloud the vision, but I think we need to focus first on the problem that you cannot define words that are synonyms and another type of word.

I don't know how versed you are in parsing techniques, but the simple answer to why synonyms becomes a problem, is, as it is expressed in the manual, "synonyms are always interchangable". The interpreter knows about synonyms, and if one is encountered in the player input by the interpreter, it always replaces it with the "original" word. In short, this is done already in the "scanning" phase in the interpreter. Then the actual parsing of the command starts.

By re-imagining the handling of synonyms it might be possible to at least alleviate the problem. I think there are alternatives to the current approach were synonyms are propagated to the interpreter. E.g. it might be possible to make synonyms strictly a concern for the compiler, not propagating synonyms to the interpreter. This is just an early idea, so no commitments yet. If this is possible it would mean that the compiler just creates extra words of the correct types (which would in principle be the same as expanding the original word with all its synonyms in the syntax, exit or where ever the definition was made).

I'll try to construct an example using the da/dai problematic example, so that we have some concrete to discuss, and see where that leads us. I'll get back to you.

thoni56 commented 3 years ago

I've created a gist that tries to mimic an Italian game with the example of "dai" not being possible to use as a synonym for "da".

My knowledge of Italian grammar is not good enough to figure out an example noun that would warrant the use of "dai" in the guarda_fuori_da verb. Could you help with that, please?

Once we have that I can experiment with various solutions to the problem that it is not allowed to add "dai" to the synonyms for "da".

(For my curiosity, are most of those synonyms also contractions? I was thinking that e.g. "dalla" would be a contraction for "da la" when the noun is femnine, because you don't use the definitive article in that context, right? If so, would "dai" be used when the noun is plural? Kind of guessing here, but interested since I have a long term plan to learn Italian...)

tajmone commented 3 years ago

I don't know how versed you are in parsing techniques, but the simple answer to why synonyms becomes a problem, is, as it is expressed in the manual, "synonyms are always interchangable".

I have a general understanding of parsing techniques, but fail to understand the details of everything that deal with manipulating the generated AST (still studying the topic, and finding it hard).

The interpreter knows about synonyms, and if one is encountered in the player input by the interpreter, it always replaces it with the "original" word. In short, this is done already in the "scanning" phase in the interpreter. Then the actual parsing of the command starts.

I was aware of that, but I still don't know the details about the way that ALAN stores the different types of words (objects names, verbs IDs, <PLAYER>_WORDS), directions, etc.). I'm assuming that these are stored into separate lists or maps (and I remember vaguely that this is the case, from peeking at the structure of story files).

What I fail to understand is why some duplicate words are allowed, while other types are not. Also, most IF parsers allow directions with same names as other objects types, so I believe that usually directions are handled as a special case of one-word player inputs in most IF systems.

I think there are alternatives to the current approach were synonyms are propagated to the interpreter. E.g. it might be possible to make synonyms strictly a concern for the compiler, not propagating synonyms to the interpreter. [...] it would mean that the compiler just creates extra words of the correct types

This sounds like a significant bloat in the generated story files. Keep in mind how many prepositions there are in languages like Italian, which have variations based on gender, noun and number (GNA).

I originally thought that the interpreter manages all these synonyms as independent lists/maps, one for each different word category/type, and that the parser would handle alternative interpretation by some disambiguating mechanism, with some scores concerning the likelihood of each input being the pertinent one.

Also, I was wondering if The Inform Designer’s Manual (aka DM4)  might contain some clues in this respect, for it usually contains very detailed accounts of how the various parts of IF systems have to deal with the challenges of the variety of different languages — and usually does so with very practical examples. It's thank to DM4 that I learned how the GNA system works, which guided me in laying the bases of the ALAN Italian module. So, it might be worth looking into it for inspirations, especially regarding the potential pitfalls of some languages which we don't know but which end users might wish to create a new i18n ALAN library in the future.

thoni56 commented 3 years ago

What I fail to understand is why some duplicate words are allowed, while other types are not. Also, most IF parsers allow directions with same names as other objects types, so I believe that usually directions are handled as a special case of one-word player inputs in most IF systems.

Basically the interpreter dictionary structure for "real words" combines information for multiple types in one structure which also indicates which types of word it can be. This e.g. allows a noun and an adjective to occupy "the same data space". Obviously we also need to store both the strings otherwise we can't recognize them in the player input.

As I mentioned above synonyms are a completely different breed of words. It is not a "word class" but simple-minded substitutions that are carried out before command parsing begins. And since the synonyms replacement is completely separate phase with no parsing information available, words that are synonyms cannot be any other type of word because they would always be substituted for its "original" word. So the compiler currently prohibits this.

This is why I currently think that getting rid of synonyms substitution in the interpreter would be a good idea.

This sounds like a significant bloat in the generated story files. Keep in mind how many prepositions there are in languages like Italian, which have variations based on gender, noun and number (GNA).

It would not. Remember that the bulk of the information in a game is the texts, including the strings in the dictionary. The number, and size, of such strings would actually be exactly the same in both cases since the same strings would have to be stored. Also in some cases having a synonym expand into two words would avoid a "syntax synonym" which is more expensive.

And we are talking about a few bytes per word here, so even if we did just add the dictionary information for them (disregarding the strings) it would add a few kB in a game that is probably at least 1MB (Wyldkynd) in size. I'm not sure how big the current synonym table that is a part of the interpreter data is, put that would also not be needed anymore.

Also, I was wondering if The Inform Designer’s Manual (aka DM4)  might contain some clues in this respect, for it usually contains very detailed accounts of how the various parts of IF systems have to deal with the challenges of the variety of different languages — and usually does so with very practical examples.

That's a good pointer. I have "always" wanted to read the DM, but never got around to it. Maybe it is time.

Although I don't want to inflate this issue to "let's make Alans command parsing fit any language". If there is a problem that can be fixed with reasonable effort, I'm all for it. If we don't see an example of a potential problem in the wild (yet), let's leave it for now. That's why I'm more interested in the concrete cases.

tajmone commented 3 years ago

My knowledge of Italian grammar is not good enough to figure out an example noun that would warrant the use of "dai" in the guarda_fuori_da verb. Could you help with that, please?

Here are some additions that can fit into the original example:

Syntax scendi_da = scendi da (ogg).
       scendi_da = scendi dai (ogg).

Add To Every object
  -- 'scendi da' -> 'climb down from'
  Verb scendi_da
    Does
      "Scendi dai" Say ogg. "."
  End Verb.
End Add.

-- 'tubi' -> 'pipes' [masc. plur. -> "i tubi", "dei tubi" (some pipes)]
The tubi Isa object At l
  Definite Article "i"

It's worth noting here that "dei tubi" (some pipes, when described with indefinite article, e.g. "You can see some pipes") also conflicts with "dei" as in belonging to, e.g. "Il libro dei maestri" (the teachers' book).

In this case "dei" is one of the Synonyms of the di preposition, but I didn't encounter any conflicts with this so far, except if the author defines a dei entity (gods, plural form of "dio").

(For my curiosity, are most of those synonyms also contractions? I was thinking that e.g. "dalla" would be a contraction for "da la" when the noun is femnine, because you don't use the definitive article in that context, right? If so, would "dai" be used when the noun is plural?

They indeed are, but in modern Italian you can't use their uncontracted form anymore, unless you're referring to a name that starts with a definite article — e.g. "La Stampa" (a famous Italian newspaper), you don't say "della La Stampa" but "de La Stampa", which sounds very odd and archaic, in fact many people today would just say/write "della Stampa" instead (but it's incorrect).

The complex part here is that some nouns that have same GNA can take different articles, based on their initial syllable. E.g. "giganti" and "studenti" (giants and students, both masc. plur.):

art/prep giganti studenti
the i giganti gli studenti
of the/some dei giganti degli studenti
from the dai giganti dagli studenti
to the ai giganti agli studenti

Kind of guessing here, but interested since I have a long term plan to learn Italian...)

Well don't hesitate to ask if you need help. Also, if you like I could send you some Italian magazines, comics and books, which are helpful tools to learn the language (especially magazines, which have photos, captions and titles, which help grasping the context intuitively).

This excellent article by Max Bianchi (aka torredifuoco) covers the topic from an IF point of view, and in an excellent manner:

Note that I had written some Wiki pages on the various problems and techniques in porting ALAN to Italian, a long time ago, all of which are thoroughly commented, with source examples and external links:

Even if they are a bit old, and might not match 100% the current state of the Italian module, the key concepts are still the same and valid.

tajmone commented 3 years ago

And we are talking about a few bytes per word here, so even if we did just add the dictionary information for them (disregarding the strings) it would add a few kB in a game that is probably at least 1MB (Wyldkynd) in size.

I thought that in Syntax definitions only the initial word would be thus stored (for lookup), and that the other interspersed word would be stored as strings. E.g. in:

syntax climb_out_of = climb out of (obj).

does every word get stored? i.e. "climb", "out" and "of"?

This is why I currently think that getting rid of synonyms substitution in the interpreter would be a good idea.

How would the new system feel on the authors' side work, would it be just as before, except from the fact that synonyms would now be handled by the compiler instead of the interpreter?

I often wondered whether having a special notation for inline text-alternatives might mitigate the problem on the author's side. E.g.:

Syntax scendi_da = scendi [da|dai] (ogg).

This could be just syntactic sugar to avoid having to define a synonym separately, or work in some other way depending on its occurring context. Inform7 uses a similar notation for accepted text alternatives, which is very intuitive.

That's a good pointer. I have "always" wanted to read the DM, but never got around to it. Maybe it is time.

It's really worth it, because it gather real case examples from the various languages for which there are actually many IF games (English, French, German, Italian, Spanish, etc.) and that were contributed by the IF community over decades of Inform development. So it does contain some rare i18n gems which are based on practical examples and development strategies. Also, Nelson is very good at explaining things, both from a linguistic grammatical point of view, as well as from the implementation perspective.

Enable Discussions and Move There This Issue

PS: I think you should enable Discussions on this repository, so we can move there Issues which have grown too long, and keep Issues uncluttered, using them only for practical maintenance/dev task. Discussions are a fairly new feature, but you can now freely convert any Issue into a discussion, and you can manage different categories too, which makes Discussion easier to navigate. An added benefit of Discussionss is that they are more forum like, allowing structured replies, whereas Issue are linear, making it hard to keep related posts together.