alan-if / alan-i18n

ALAN Internationalization Project
Other
0 stars 1 forks source link

Wrapping Bug for Spanish GNA #42

Open tajmone opened 2 years ago

tajmone commented 2 years ago

@thoni56, I've noticed an issue with how ALAN wraps transcripts.

Example, in ponibles_test.a3t the 's' of "puestos" is split on the next line:

> x gemelos
Unos gemelos. Ambos llevan la misma ropa puesta. Los gemelos llevan puesto
s unos pantalones y unas botas.

This is the library code that prints "puestos":

  "llevas puest$$"
  If obj is femenina
    then "a"
    else "o"
  End If.
  If obj is plural then "$$s" End If.

The above is a typical example of how the Spanish library handles gender and noun in various language constructs (adjectives, articles, verbs, etc.), by adding the 'a' or 'o' suffix depending on gender, and a final 's' if plural.

The problem seems to be that when the sentence reaches "puesto" (column 75) ALAN decides it's time to wrap without checking whether the upcoming text contains a $$ (or punctuation) which might need to be joined with the current word (i.e. the one being parsed when ALAN decides to wrap).

If I were to replace the above code with:

  "llevas puest$$"
  If obj is femenina
    then
      If obj is plural
        then "as"
        else "a"
      End If.
    else
      If obj is plural
        then "os"
        else "o"
      End If.
  End If.
  If obj is plural then "$$s" End If.

the output wouldn't be truncated prematurely. Apparently, ALAN sees the $$ and waits before wrapping. The problem is that the above code variations is more verbose compared the to one being uses, because we only add the final 's' if the noun is plural (so no $$ on the previous vowel, in case there's not need to add a plurality 's').

Probably I should add a proper minimum viable ad hoc test in the alan-bugs-testbed, but I wanted to mention it right away when I discovered it, and begin by posting here on ALAN i18n, since this affects the Spanish library and we all need to be aware of the issue and decide if it's worth using the longer code to prevent breaking the word.

Also, I'm not sure why ALAN is wrapping at 75, since I believe the default is 80 columns. I think this issue of incorrect wrapping already came up before, and was due to miscounting the various special $ symbols in a way that affected columns book-keeping for when to wrap. But I thought that the problem had been solved already.

In any case, this problem also affects punctuation, for I noticed in various transcripts that ALAN wraps lines just before a ., , or ) (or other punctuation marks), which doesn't look nice either. I'm not sure if this is due to the presence of a $$ in the previous token or preceding the punctuation mark, but definitely ALAN should do some lookahead scrutiny before wrapping, to check that the next string "token" is not something that needs to be adjoined with the current one.

From what I remember from peeking at the ALAN sources, the way output strings work in ALAN is a bit intricate, since some strings are retrieved from disk (those that are within quotes in the source) while others are taken from memory (those stored as attributes), and that the way these are handles is a bit complex due to Huffman compression — so the whole process is a very fragmented series of long jumps in C, where the various snippets that will form a string a retrieved as the AMachine munches code in real time.

I'm not sure where the part that handles wrapping falls in the process, but it looks like strings are truncated as they are being "stitched together", i.e. there's no "paragraphs buffer" where they are stored for later inspection-&-wrapping. I guess that probably adding some lookahead functionality to prevent cases like the above would require lot's of code changes.

thoni56 commented 2 years ago

I will have a look at this. The 75 is mysterious, so at least I'm curious why that is.

This only affects the command line interpreter, of course, since all GLK-terps have their own wrapping logic. ("Only" here only means "just" since we all use the command line interpreter for automated testing, so it is important.)

Or, do you see this in WinArun or Gargoyle too?

tajmone commented 2 years ago

Or, do you see this in WinArun or Gargoyle too?

No only ARun for automated testing. I might use the GUI terps for visual inspection of text styles, and multimedia, since these can't be verified via command line, but these are not part of the test suite, strictly speaking.

I will have a look at this. The 75 is mysterious, so at least I'm curious why that is.

From the way text wraps before $$ "gluing" of segments, I had the general impression the problem might have to do with when the wrapping function takes action — i.e. some sort of premature decision, not accounting for extra text along the line that might be annexed to the current lexeme.

The 75 mystery keep cropping up for some reason. I remember that there has already been a fix in the past for this, but it seems that the transition to UTF-8 has brought the problem back. So I wonder if encoding might be the culprit here, i.e. the counter not accounting for multi-byte characters in the source that then become single bytes in ISO.