Closed ivanpedruzzi closed 2 years ago
Thank you for your kind comments.
I was wondering if you have a suggestion for how to save the scanner state and position and how to restart the scanner at specific position and state.
Saving the scanner state with push_state()
(Flex yy_push_state()
) and restoring with pop_state()
(Flex yy_pop_state()
) does not save and reposition the location of the input to scan. Scanning will simply continue.
Saving the scanner buffer (input state) with push_matcher()
(Flex yypush_buffer_state()
) can be used to temporarily switch to a different input source to scan, then restore the original input with pop_matcher()
. You could use this to scan the same file from the current location, but you will have to open the current file as a new file and move the cursor to the location with seek()
, scan the input, and close it before pop_matcher()
.
Does the scanner require to consume the complete input or can it operate on-demand, processing chunks?
Normally the scanner reads 64K chunks of input and does not keep the whole file in memory. Unless you use in(str)
with string str
to scan or more efficiently buffer(buf, n+1)
(Flex yy_scan_buffer(buf, n+1)
) to scan buf
as input with n
bytes (+1 or 2 bytes are needed in buf
to serve to store a terminating \0).
Robert, I am failing to replicate what I had working with flex. In my original implementation I simply overridden scanner LexerInput to wire my own data stream.
With reflex even if I override LexerInput the Matcher checks its own object “Input”.
yylex calls Matcher.buffer(), this is the basic loop in method "buffer", method “get” calls LexerInput
while (in.good()) { (void)grow(); end += get(buf + end, max - end_); }
should the while simply check if the number of bytes returned from "get" are less than what was required?
My second attempt was to create my own implementation of reflex::Input but the Matcher uses an instance instead of a pointer, it seems there is no way make the Matcher to use an Input sub-class.
Clearly, I am missing something obvious.
The get()
method can return fewer than max_ - end_
bytes. But zero returned means EOF. I see that it's not checking that, so we need to change this to:
while (in.good()) // there is more to get while good(), e.g. via wrap()
{
(void)grow();
size_t len = get(buf_ + end_, max_ - end_);
if (len == 0)
break;
end_ += len;
}
Does that help?
In addition to your astute suggestion, I had to convert from unicode to UTF-8, before copying data into the buffer.
size_t CScannerBase::LexerInput(char* buf, size_t char_max_size)
{
m_wchar_buffer = ...
Why relying on null terminal? the lexer could use the length returned.
Next working on how to restore Matcher's state and resume from a given offset
Why relying on null terminal? the lexer could use the length returned.
The size returned is used by the scanner. The LexerInput
is called until it returns 0 as per documentation:
The LexerInput
method may be invoked multiple times by the matcher engine
and should eventually return zero to indicate the end of input is reached (e.g.
when at EOF).
The piece-by-piece returned string in the buffer buf
does not need to have a final terminating \0.
I'm pretty sure this is the way Flex works, which is replicated.
Robert is there a switch in the code generator that could change the following initial lines in the yylex function from local variables to virtual functions or even static data members, so they can be used in other methods?
int CJSONScanner::yylex(void) { static const char * REGEX_INITIAL = "...."; static const reflex::Pattern PATTERN_INITIAL(REGEX_INITIAL); ....
I would like to initialize the matcher in the constructor like the following but I do not want to copy the code every time the REGEX_INITIAL is generated, notice that I want the matcher to buffer at most 1024 characters.
CJSONScanner::CJSONScanner(IEditBlockBuffer pSmartBuf, bool bCacheTokens /= true/) :CScannerBase(pSmartBuf) { static const char REGEX_INITIAL = ".."; static const reflex::Pattern PATTERN_INITIAL(REGEX_INITIAL); matcher(new Matcher(PATTERN_INITIAL, stdinit(), this)); matcher().buffer(1024); YY_USER_INIT
Perhaps static data members would be best to be able to share their values. There is currently no option switch to do this in reflex. IMO moving the REGEX_INITIAL
and PATTERN_INITIAL
to become static members is nice. I will have to check if that always works without affecting something.
Also, some way to insert code at YY_USER_INIT
to initialize the matcher would be nice, e.g. to set the buffer size.
I'll take a look.
I propose to add the following directive %begin
to inject code into the scanner before scanning starts:
%begin{
BEGIN
}
This outputs BEGIN
code right after the matcher is created and before scanning starts:
if (!has_matcher())
{
matcher(new Matcher(PATTERN_INITIAL, stdinit(), this));
YY_USER_INIT // inserted with flex option
BEGIN
}
PS. I don't think that creating static members for REGEX_INITIAL
and PATTERN_INITIAL
is needed with this approach. If need be, these can be passed to some method calls specified in BEGIN
.
I do not believe this solve the problem I reported. I would like to initialize the matcher elsewhere, it seems that your change only covers the initialization in function yylex(). Having the PATTERN_INITIAL generated as local variable seems extremely rigid.
To make the regex and pattern objects static members of the lexer class does not make a lot of sense. The regex and pattern values are kept hidden, because their types and values depend on options, such as --fast
and --matcher=boost
. There is no regex string when --fast
and --full
are used. The pattern value is a different type, depending on the matcher engine. Making these accessible as members will cause compilation errors when options change. It is also possible that future reflex versions use different types and/or additions to these. That is why I placed them as locals in the lex function. Creating members for these does not solve anything that cannot be solved already by adding a BEGIN
. For example, you want to add this line to initialize the matcher:
matcher().buffer(1024);
This only affects the matcher, not the regex or pattern. Those are different (hidden) entities. So you'll just have to declare:
%begin{
matcher().buffer(1024);
}
I don't understand your comment why you really need the regex and patterns to be static members. Including a user-definable BEGIN
code solves your problem. It also can serve additional purposes. We already have YY_USER_INIT
, which is inserted with the --flex
option but none is inserted otherwise. This plugs that hole.
I will also update option --batch
to take an argument. So --batch=1024
generates code the initialize the matcher:
matcher().buffer(1024);
Hello, I love Re/flex so far and the support for Unicode is FANTASTIC!
I was wondering if you have a suggestion for how to save the scanner state and position and how to restart the scanner at specific position and state. Any hint will be greatly appreciated.
Does the scanner require to consume the complete input or can it operate on-demand, processing chunks? For example if I provide an istream implementation with GBs of data , can the scanner lazy load without pulling the entire data in memory?
Thank you -Ivan