Genivia / RE-flex

A high-performance C++ regex library and lexical analyzer generator with Unicode support. Extends Flex++ with Unicode support, indent/dedent anchors, lazy quantifiers, functions for lex and syntax error reporting and more. Seamlessly integrates with Bison and other parsers.
https://www.genivia.com/doc/reflex/html
BSD 3-Clause "New" or "Revised" License
507 stars 85 forks source link

Please explicitly document the content ownership semantics for `reflex::Input` #135

Closed tlemo closed 2 years ago

tlemo commented 2 years ago

I just discovered, the hard way, that reflex::Input doesn't make a copy of the source text. This, by itself, may be a valid design choice, but it doesn't seem to be documented and can lead to pain and suffering in seemingly simple use cases, for example:

std::string foo() { ... }

// The temporary string is implicitly converted to `reflex::Input`
Lexer(foo()).lex(); 

A conspicuous note about the content lifetime constraints would help new users avoid this pitfall.

Also, a Input(std::string&&) = delete; constructor might catch some of the common, although not all, cases.

genivia-inc commented 2 years ago

Good point. Love the feedback to improve the project.

Lexer input is incrementally copied to a mutable buffer with a fixed length. The buffer must be mutable, because text() (and yytext) places a \0 in the buffer. The source input (e.g. a string) is just specified a (persistent) source of input that is never modified by the scanner (or copied). If it is a temporary string, then the string will be lost before it can be scanned, which is the problem here.

I like the suggestion to use Input(std::string&&) = delete, but which is not portable to classic C++, which made me reluctant to use it in the past for this project, although we can add a #if __cplusplus >= 201103L check I suppose.

FYI. A mutable buffer can be specified for the lexer to scan with the buffer(source, length + 1) method, same as Flex yy_scan_buffer(source, length + 2).

genivia-inc commented 2 years ago

The generated lexer source will have additional comments to clarify the input/output, for example:

class yyFlexLexer : public FlexLexer {
 public:
  yyFlexLexer(
      // a persistent source of input, empty by default
      const reflex::Input& input = reflex::Input(),
      // optional output stream, NULL means std::cout by default
      std::ostream *os = NULL)
    :
      FlexLexer(input, os)
  {
  }

I've already updated the online documentation section that describes sources of input to the lexer, by adding a note: https://www.genivia.com/doc/reflex/html/index.html#reflex-input

tlemo commented 2 years ago

I've already updated the online documentation section that describes sources of input to the lexer, by adding a note: https://www.genivia.com/doc/reflex/html/index.html#reflex-input

The note looks good to me, thanks for the quick response!