DavidKinder / Inform6

The latest version of the Inform 6 compiler, used for generating interactive fiction games.
http://inform-fiction.org/
Other
204 stars 34 forks source link

Grammar version 3 #277

Closed heasm66 closed 5 months ago

heasm66 commented 6 months ago

This is already started but include the definition for later reference to pull request.

GV3 is a variant of GV2 with a more compact data structure.  GV3 only use
2 bytes for each token and removes the need for the ENDIT marker.  In GV3 
an individual grammar table has the format:

    <number of grammar lines>          1 byte

followed by that many grammar lines.  A grammar line have the form:

    <action number>  <token 1> ... <token N>
    ----2 bytes----  -2 bytes-     -2 bytes-

The action number is actually contained in the bottom 10 bits of the word
given first: the top five contains the number of tokens in this grammar 
line, which leaves.

    action_number & $400

as a bit meaning "reverse the order of the first and second parameters
if this action is to be chosen".

There can be anything from 0 to 31 tokens, and each occupies two bytes, 
arranged as:

    <token type>   <token data>
    -- byte ----   --- byte ---

Token type bytes are divided into the top two bits, the next two and the
bottom four.

The "next two bits" are used to indicate alternatives.  In a sequence of
tokens

    T1 / T2 / T3 / ... / Tn

then T1 will have $$10 in its "next two bits", and each of T2 to Tn will
have $$01.  Tokens not inside lists of alternatives always have $00.  (Note
that at present only prepositions are allowed as alternatives, but the
format is designed to open the possibility of extending this to all tokens.)

The bottom four are the "type" of the token.  The top two indicate what kind
of data is contained in the token data.  Strictly speaking this could be
deduced from the bottom six bits, but it's convenient for making backpatching
GV3 tables a simple matter within the compiler.

    Type  Means                       Data contains              Top bits
    0     illegal (never compiled)
    1     elementary token            0   "noun"                 00
                                      1   "held"
                                      2   "multi"
                                      3   "multiheld"
                                      4   "multiexcept"
                                      5   "multiinside"
                                      6   "creature"
                                      7   "special"
                                      8   "number"
                                      9   "topic"
    2     'preposition'               adjective number           01
    3     noun = Routine              parsing-routine-number     10
    4     attribute                   attribute number           00
    5     scope = Routine             parsing-routine-number     10
    6     Routine                     parsing-routine-number     10

GV3 identify a particular preposition or parsing-routine using a numbering system.  
GV3 numbers parsing-routines upwards from 0 to 255, in order of first use.  
A separate table translates these into routine packed addresses: the 
"preactions" table.  The preactions table is a simple --> array.

Prepositions are also identified by their "adjective number".  Adjective
numbers count downwards from 0 to 255, in order of first use.  They are 
translated back into dictionary words using the "adjectives table".

The adjectives table starts with two bytes containing the number of
"adjectives" in the table. Each "adjective" entry then are two bytes:

    <dictionary address of word>
    ----2 bytes-----------------

The constant #adjectives_table refer to this table.

As in GV2, fake actions in GV3 are numbered from 4096 upwards.

Note that although GV3 reintroduces the preaction and adjective table,
the omission of the ENDIT marker and two byte tokens instead of three 
byte, should produce a more economical grammar table.

Comparison table between the different grammar versions:

                                        Limit in: 
                                        GV1    GV2          GV3

    Prepositions per game               76     unlimited    256
    Parsing routines (general ones,
       noun= filters, scope= routines
       all put together) per game       32     unlimited    256
    Tokens per grammar line             6      unlimited    31
    Actions per game                    256    1024         1024
    Inform verbs per game               256    256          256

EDIT: Fxed the text for action-number as four bits for token-count, 2 bits for flag and 10 bits for action-number. EDIT 2: Removed text about the meta-flag. It has been moved to an seperate issue.

erkyrath commented 6 months ago

Discussion thread: https://intfiction.org/t/is-the-world-ready-for-a-new-inform6-grammar-format/67391

I am minded to split out the "meta flag in grammar line" feature to an independent compiler setting.

heasm66 commented 6 months ago

I've changed the specification in first post, removing the meta-flag because it is moved to an seperate issue.

To test GV3 with the Standard Library 6.12.6 AnalyseTokenand UnpackgrammerLineneed to be changed to somthing like:

#Iftrue (Grammar__Version == 3);

[ AnalyseToken token;
    found_ttype = (token->0) & $$1111;
    found_tdata = (token->1);
    if (found_ttype == PREPOSITION_TT)
        found_tdata = #adjectives_table-->found_tdata;
    if (found_ttype == ROUTINE_FILTER_TT or GPR_TT or SCOPE_TT)
        found_tdata = #preactions_table-->found_tdata;
];

[ UnpackGrammarLine line_address i tokens;
    for (i=0 : i<32 : i++) {
        line_token-->i = ENDIT_TOKEN;
        line_ttype-->i = ELEMENTARY_TT;
        line_tdata-->i = ENDIT_TOKEN;
    }
    action_to_be = 256*(line_address->0) + line_address->1;
    action_reversed = ((action_to_be & $400) ~= 0);
    tokens = ((action_to_be & $f800) / 2048);
    action_to_be = action_to_be & $3ff;
    params_wanted = 0;
    for (i=0 : i<tokens : i++) {
        line_address = line_address + 2;
        line_token-->i = line_address;
        AnalyseToken(line_address);
        if (found_ttype ~= PREPOSITION_TT) params_wanted++;
        line_ttype-->i = found_ttype;
        line_tdata-->i = found_tdata;
    }
    return line_address + 2;
];

#Endif; ! Grammar__Version 3