Ragel v7 scanner does not set/reset ts/te after the loop ending in a full token

gh-andre commented 3 years ago

This is a regression from Ragel 6.10, which set ts and te pointers as described in the manual, which says this:

This variable may be used in action code for retrieving the text of the current match. Ragel ensures that in between tokens and outside of the longest-match machines that this pointer is set to null. In between calls to the execute code the user must check if ts is set and if so, ensure that the data it points to is preserved ahead of the next buffer block.

Here's the program to show this behavior:

#include <cstdio>
#include <cstdlib>
#include <string>

%%{
machine part_token;

main := |*
     ' '+;
     ',';
     'abcd' => { printf("token: %s\n", std::string(ts, te).c_str()); };
  *|;
}%%

%% write data;

int main(int argc, char *argv[])
{
  if(argc != 2) {
    printf("Usage: scan token\n");
    return EXIT_FAILURE;
  }

  std::string in(argv[1]);

  int cs = 0, act = 0;
  const char *p = in.c_str();
  const char *pe = in.c_str() + in.length();
  const char *ts = nullptr, *te = nullptr;
  const char *eof = pe;

  %% write init;
  %% write exec;

  printf("done\n");

  if(cs == part_token_error)
     printf("Error near %zd\n", p-in.data());
  else if(ts)
    printf("offsets: ts %zd te: %zd pe: %zd\n", ts-in.data(), te-in.data(), pe-in.data());
  else
    printf("ts is null\n");

  return EXIT_SUCCESS;
}

Compiled on CentOS 8 Stream with Ragel 7.0.0.12, which is available as a package, and with Ragel 6.10, built on the same machine from the source at http://www.colm.net/files/ragel/ragel-6.10.tar.gz.

ragel -o scan.cpp scan.rl
g++ -std=c++17 -o scan scan.cpp

Both scanners ran this way:

./scan abcd
./scan abcd,ab

Ragel 6.10 produces this output:

./scan abcd
token: abcd
done
ts is null

./scan abcd,ab
token: abcd
done
offsets: ts 5 te: 5 pe: 7

Ragel 7 produces this output:

./scan abcd
token: abcd
done
offsets: ts 4 te: 4 pe: 4

./scan abcd,ab
token: abcd
done
offsets: ts 5 te: 5 pe: 7

ts is no longer reset to a null pointer after the execution loop if the input ends in a full token.

adrian-thurston commented 2 years ago

Hi @gh-andre thanks for reporting this.

This is a consequence of

Ragel using from-state actions to set ts = p on the start of a token (always the case).
In ragel 7 from-state actions getting executed on EOF. This was done to better align with leaving-state actions getting executed on EOF. It helps with zero-length matches.

I'm not sure what to do about this at the moment, but thinking about it.

adrian-thurston commented 2 years ago

Note that this is only the case after EOF. It means you can check that EOF has not been reached yet. It's still not ideal though. It should stay at zero.

adrian-thurston commented 2 years ago

Proposed solution in df0a63a7e779fa8e25c92099e65e9fd10a053241.

gh-andre commented 2 years ago

@adrian-thurston Thank you for looking into this. I worked it around in my code by checking whether ts and pe are the same, in addition to checking whether ts is a null pointer or not. I'm not familiar with Ragel source and cannot comment on the proposed fix, but will be able to test it once I'm back in the office next week, if you don't mind the delay.

gh-andre commented 2 years ago

@adrian-thurston Sorry for the delay - colm was conflicting with the one installed via dnf and I needed how to build both from the commit.

I tested various scenarios and the new Ragel behaves the same as v6.10. I also tested the scenario I forgot to mention, when the first token is incomplete (i.e. ./scan abc) and verified that ts is set at the start of input, while te is not set and in this example remains nullptr, while pe in each case is advanced past the end of the last character (i.e. points to the null character).

Seems to be working well for the scenarios I can think of. Thanks again for looking into this.

adrian-thurston commented 2 years ago

@gh-andre great, thanks for trying it out!

adrian-thurston / ragel

Ragel v7 scanner does not set/reset ts/te after the loop ending in a full token #76