Genivia / RE-flex

A high-performance C++ regex library and lexical analyzer generator with Unicode support. Extends Flex++ with Unicode support, indent/dedent anchors, lazy quantifiers, functions for lex and syntax error reporting and more. Seamlessly integrates with Bison and other parsers.
https://www.genivia.com/doc/reflex/html
BSD 3-Clause "New" or "Revised" License
529 stars 86 forks source link

Multiple definitions of reflex_code_INITIAL when using multiple lexers #173

Closed alastairpatrick closed 1 year ago

alastairpatrick commented 1 year ago

My program has two lexers (source attached). The generated source code contains multiple functions with the same signature and external linkage, resulting in this build warning from MSVC linker:

D:\src\yycc\out\build\x64-Debug\PPTokenLexer.yy.cpp.obj : warning LNK4006: "void __cdecl reflex_code_INITIAL(class reflex::Matcher &)" (?reflex_code_INITIAL@@YAXAEAVMatcher@reflex@@@Z) already defined in IdentifierLexer.yy.cpp.obj; second definition ignored

This is the declaration of the generated function causing the warning:

extern void reflex_code_INITIAL(reflex::Matcher&);

Based on what the RE-flex user manual says in section "Combining multiple lexers", I think the intent is that this should work and that it should not be necessary to put the lexers in separate namespaces.

I modified RE-flex to instead give the function internal linkage, which worked. I'm not sure if it is the right approach though, hence bug report rather than pull request.

void Reflex::write_lexer()
{
  // ...

  if (options["matcher"].empty() && !options["fast"].empty())
  {
    for (Start start = 0; start < conditions.size(); ++start)
    {
      if (!options["namespace"].empty())
        write_namespace_open();

      // I changed "extern" to "static" on this line
      *out << "static void reflex_code_" << prefix_opt << conditions[start] << "(reflex::Matcher&);\n";

      if (!options["find"].empty())
        *out << "extern const reflex::Pattern::Pred reflex_pred_" << prefix_opt << conditions[start] << "[];\n";
      if (!options["namespace"].empty())
        write_namespace_close();
    }
    *out << '\n';
  }
  //...

RE-flex version is 3.3.0.

PPTokenLexer.l.txt IdentifierLexer.l.txt

genivia-inc commented 1 year ago

Thanks for the feedback. I will take a look at this soon to decide if static is permissible with respect to the intended use cases.

genivia-inc commented 1 year ago

Changing to static is permissible if option --tables-file is not used. This option produces a separate source code file with the DFA code and/or tables that are compiled and linked with the generated scanner code. However, changing to static also causes problems with --fast not compiling any longer. So this needs more work. The first step in that direction is:

      if (!options["tables_file"].empty())
      {
        *out << "extern void reflex_code_" << prefix_opt << conditions[start] << "(reflex::Matcher&);\n";
        if (!options["find"].empty())
          *out << "extern const reflex::Pattern::Pred reflex_pred_" << prefix_opt << conditions[start] << "[];\n";
      }
      else
      {
        *out << "static void reflex_code_" << prefix_opt << conditions[start] << "(reflex::Matcher&);\n";
        if (!options["find"].empty())
          *out << "static const reflex::Pattern::Pred reflex_pred_" << prefix_opt << conditions[start] << "[];\n";
      }
alastairpatrick commented 1 year ago

Perhaps instead of replacing extern with static, it would help to incorporate the lexer class name in the function name. Currently, it generates:

extern void reflex_code_INITIAL(reflex::Matcher&);

Suppose I have two lexer classes named "PPTokenLexer" and "IdentifierLexer", it could instead generate these for their INITIAL states:

extern void reflex_code_PPTokenLexer_INITIAL(reflex::Matcher&);
extern void reflex_code_IdentifierLexer_INITIAL(reflex::Matcher&);

It seems to already do something along these lines when --flex mode is in use. For example, if two flex lexers were prefixed "pptok" and "id", the current code would (I think) generate these:

extern void reflex_code_pptokINITIAL(reflex::Matcher&);
extern void reflex_code_idINITIAL(reflex::Matcher&);

So perhaps everything could be made consistent by always using the lexer class name and not including the prefix (since the prefix is incorporated in the lexer class name). For example, in my --flex mode example above, these might be generated:

extern void reflex_code_pptokFlexLexer_INITIAL(reflex::Matcher&);
extern void reflex_code_idFlexLexer_INITIAL(reflex::Matcher&);
genivia-inc commented 1 year ago

Renaming should work. But actually, C++ namespaces work fine too. Just add a namespace directive to the lexer file:

%option namespace=pptok
genivia-inc commented 1 year ago

Or you can use

%option prefix=pptok_

to prefix the definitions to get reflex_code_pptok_INITIAL.

alastairpatrick commented 1 year ago

Setting a different namespace for each lexer works as expected.

Setting a prefix leads to the lex() member function of the lexer being renamed. For example, if my lexer contained:

%option lexer=PPTokenLexer
%option prefix=pptok_
%option fast

Above generates:

class PPTokenLexer : public reflex::AbstractLexer<reflex::Matcher> {
 public:
...
  int pptok_lex(const reflex::Input& input)

Naming the function pptok_lex() makes sense for a flex lexer callable from C but for a C++ member function of the lexer class, simply lex() seems preferable. So I think namespaces are the way to go.

Regardless, using distinct prefixes for each lexer does not actually fix the linker warnings; I found another related issue with the generated code while experimenting with prefixes. More code generated for the lexer above follows:

extern void reflex_code_pptok_INITIAL(reflex::Matcher&); // <-- fumction name contains lexer prefix here
...

void reflex_code_INITIAL(reflex::Matcher& m) // <-- but it's missing here later in the same file
{
  int c0 = 0, c1 = 0;
  m.FSM_INIT(c1);

S0:
  m.FSM_FIND();
  c1 = m.FSM_CHAR();
  if (c1 == '/') goto S6;
  if (c1 == '#') goto S12;
...

Even with a distinct prefix given to each lexer, function definitions generated for the INITIAL states all have the same name. Further, the forward declarations of these functions have different names.

genivia-inc commented 1 year ago

Note that the --prefix option is a Flex option. This is a Flex-compatible option that emulates the same behavior as the Flex++ C++ output. This affects a slew of things, including the generated source code filenames.

Strange that in your case the DFA table name lacks the prefix. I'll check it out.

C++ namespaces are preferable. See also https://www.genivia.com/doc/reflex/html/index.html#reflex-multiple

genivia-inc commented 1 year ago

I've committed an update v3.3.1 with a minor change to the reflex tool output for --prefix which apparently in some cases was missing the specified prefix is in the file names and the FSM table name. I don't believe many people use the old Flex-compatible --prefix option with RE/flex ;-)