API: unrecognized vs zero

Choice a value that is returned by a recognizer when a lexeme is not recognized.

By Recognizer v4 API, a recognizer should return a special system dependent value in the case of unsuccessful recognizing. The rectype-null word (or unrecognized word in the rephrase 2020) returns this value.

In v4 its choice is justified by the (rather unfair) implementations of POSTPONE and INTERPRET only. I.e., among a system implementation and a system using, only one part was considered: a system internal implementation.

But the use of recognizers is always simpler if a recognizer returns zero in the case of unsuccessful recognizing.

1. The use of simple recognizers is easier if they return 0 on unsuccess. In any case there is no need to check for particular descriptors (since only the single descriptor can be returned). So, a code just become simpler with 0 if unsuccessful. E.g.:

"123" recognize-snum if ." it is a number" . else ." it is not a number" then cr

"123" recognize-snum unrecognized <> if ." it is a number" . else ." it is not a number" then cr

2. Creating a new recognizer as a colon definition using other recognizers is simpler if they return 0 on unsuccess. Checking for the special unrecognized value is just a redundant boilerplate without any profit. E.g.:

: recognize-quoted ( c-addr u -- xt td-lit | 0 )
  [char] ' match-char-head 0= if 2drop 0 exit then
  recognize-nt if name> dup if td-xt then exit then 0
;

: recognize-quoted ( c-addr u -- xt td-lit | td-unrecognized )
  [char] ' match-char-head 0= if 2drop unrecognized exit then
  recognize-nt unrecognized <> if name> dup if td-xt exit then drop then unrecognized
;

3. Using of recognizers with other common libraries is easier if they return 0 on unsuccess. Since distinction zero and nonzero values is a very common and convenient approach that used everywhere.

4. If a recognizer returns a special value on unsuccess, the specification has to forbid using of zero to avoid error prone code (see an example: "The code assumes that the numeric value of any rectype-data item is never zero"). It will limit the implementations that can and want to use zero as a special value for unsuccess. Therefore, we will have two special values instead of only one. And one of them is even forbidden to be used.

A one argument in favour of making recognizers return a special (formally valid and non-zero) descriptor when recognizing is unsuccessful is that a further action can be performed without any checks. And it makes code shorter.

But a strong counter argument is that if we change special non-zero value to zero, total lexical size of the overall code will be decreased.

Actually, in any Forth system we only have up to three places that become longer on this change. In API v4 they are in the following definitions: rectype>int, rectype>comp, rectype>post. An additional check should be added into these definitions to throw an exception.

But every place become shorter where unrecognized (or rectype-null in v4) is used in comparison. How often it occurs? In Gforth sources — 6 times.

In my set of integer number recognizers — 3 times. In ]], postpone, locate, recognize-n-0x, recognize-frac, recognize-word-parsing, recognize-word-parsing-inline, recognize-quoted, recognize-pqname — 10 times (in some recognizers — more than once).

As we can see, almost any non trivial recognizer becomes lexically shorter. And all other recognizers become literally shorter since 0 (or false) is shorter than unrecognized (or rectype-null).

So we already have 19 examples. And they cover with a wide margin those three places where the code becomes longer.

The command to search instances of testing rectype-null:

ag -G '.*\.f' --color -H -i 'rectype-null\s+(=|<>)'  | less -SnR

Another argument against the special xt on fail (which is called notfound in the proposal [160]: minimalistic core API for recognizers ).

Actually, it's a question about the choice between a special object and the common object (zero) for the case of fail.

Many other words returns an object id (on success), or zero (on fail) — not a special object on fail.

For example:

name>interpret ( nt -- xt | 0 )
find-name ( sd.name -- nt | 0 )
find-name-in ( sd.name wid -- nt | 0 )
find ( c-addr -- xt n | c-addr 0 )
search-wordlist ( sd.name -- xt n | 0 )
source-id ( -- fileid | -1 | 0 ) — not a fail, but also an example when zero was chosen instead of a special object.

Why the recognizers should not follow this practice and return a special id on fail, instead of zero?

Yes, the choice of special object on fail makes code smaller in some places of the use, but it makes code longer in more other places of the use! So, it just increases the overall lexical size.

I checked the source codes in Gforth (as of 2023-09-17), which include both the implementation and usage of this API:

['] notfound with = or <> is used 10 times, and without checking — 32 times.
forth-recognize execute is used 3 times.

If we use 0 (zero) instead of notfound xt (aka unrecognized), then:

['] notfound <> is removed 5 times (-15 lexemes)
['] notfound = is replaced by 0<> 5 times (-10 lexemes)
['] notfound is replaced by 0 32 times (-32 lexemes)
remove the definition for notfound and add : ?found dup 0= -13 and throw ; (less than or equal +3 lexemes)
replace forth-recognize execute by forth-recognize ?found execute 3 times (+3 lexemes)
?found can be also used after find, search-wordlist, find-name, find-name-in — when the user needs to execute their result at once.

Thus, replacing of notfound by zero reduces overall lexical size in Gforth by more than 51 lexemes (which is more than 0.4KiB in absolute size).

An excerpt from the updated above post:

replace forth-recognize execute by forth-recognize ?found execute

One more argument: the phrase forth-recognize execute has bad readability (it gives a wrong impression).

The name forth-recognize is similar to forth-wordlist, and it makes an implression that a recognizer is returned. Then, the phrase forth-recognize execute makes am impression that we execute a recognizer.

Actually, in these use cases we want to translate a lexeme. Thus, it is better to introduce a word like translate-lexeme ( i*x sd.lexeme -- j*x ) for that.

I would define this word as follows.

: ?found ( xt.translator -- xt.translator | 0 -- never ) dup if exit then -13 throw ;
: translate-lexeme ( i*x sd.lexeme -- j*x ) perceive ?found execute ;

Where

: perceive ( sd.lexeme -- token translator | 0 ) perceptor execute ;

ForthHub / fep-recognizer

API: unrecognized vs zero #4

Choice a value that is returned by a recognizer when a lexeme is not recognized.