Closed mingodad closed 5 months ago
Here is the profile of parsing (cut down) sqlite3s.c
:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
61.34 0.60 0.60 4 148.75 205.00 parser_real_parse
8.25 0.68 0.08 8237500 0.00 0.00 parser_depthOfLastAccept
6.19 0.74 0.06 53452 0.00 0.00 TL_TLS_TLSIND
5.15 0.79 0.05 27193653 0.00 0.00 tree_setKidTree
4.64 0.83 0.05 456949 0.00 0.00 scanner_scanToken
4.12 0.87 0.04 9388504 0.00 0.00 tree_newKids
2.06 0.89 0.02 2544524 0.00 0.00 tree_newTreeClone
2.06 0.91 0.02 1 20.00 20.00 unparser_printLeavesTraversal
1.55 0.93 0.02 16454632 0.00 0.00 tree_setTreeCount
1.55 0.94 0.02 parser_installAsId
1.03 0.95 0.01 16454632 0.00 0.00 tree_setKidCount
1.03 0.96 0.01 1522973 0.00 0.00 tree_makeOneKid
0.52 0.97 0.01 129438 0.00 0.00 scanner_skipSeparators
0.52 0.97 0.01 1 5.00 5.00 options
Here is the profile of parsing (full) sqlite3.c
:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
69.15 190.02 190.02 76141046 0.00 0.00 parser_depthOfLastAccept
29.95 272.33 82.31 336 0.24 0.82 parser_real_parse
0.17 272.80 0.47 85482864 0.00 0.00 tree_newKids
0.17 273.27 0.47 3537875 0.00 0.00 scanner_scanToken
0.13 273.62 0.35 274171134 0.00 0.00 tree_setKidTree
0.08 273.84 0.22 142123713 0.00 0.00 tree_setKidCount
0.05 273.98 0.15 142123713 0.00 0.00 tree_setTreeCount
0.05 274.12 0.14 30448736 0.00 0.00 tree_newTree
0.03 274.20 0.08 10157535 0.00 0.00 tree_newTreeInit
0.03 274.27 0.07 11072607 0.00 0.00 tree_newKid
0.03 274.34 0.07 198268 0.00 0.00 TL_TLS_TLSIND
0.02 274.40 0.06 11072356 0.00 0.00 tree_makeOneKid
0.02 274.45 0.05 parser_installAsId
0.01 274.49 0.04 3218220 0.00 0.00 parser_backup
Would be nice to have txl
printing the line numbers while parsing when we specify -v -p
, something like every x lines (user define number of lines would be nice too):
Now parsing line 1000 accumulated time 0.2s
Now parsing line 2000 accumulated time 0.3s
Now parsing line 3000 accumulated time 0.4s
Now parsing line 3000 accumulated time 0.5s
Now parsing line 3000 accumulated time 0.6s
Now parsing line 3000 accumulated time 1.6s //showing that it got stuck around here ???
After commenting out sqlite3.c
in several places only to see txl
parsing till the end or failing, I've got txl
to parse till the end and failing with an error message in 9.6s that's a reasonable time using 360MB of memory using the same grammar attached before.
The point of this issue is to discover why txl
is not failing fast to allow more iterations till fix the grammar or anything else blocking the intended use.
`` /usr/bin/time txl -p -v -s 1000 sqlie3.c c2.txl [c2.txl] : TXL0911W - (Warning) Stack limit less than recommended for TXL size (probable cause: shell stack limit) [c2.txl] : TXL0913I - (Information) Recursion stack limit reduced from 51200000 to 7340032 to fit OpenTxl v11.3.2 (1.4.24) (c) 2024 James R. Cordy and others Bootstrapping TXL ... ... used 401 trees and 251 kids. Scanning the TXL program c2.txl Parsing the TXL program ... used 10975 trees and 16420 kids. Making the input language grammar tree ... used 1332 trees and 1563 kids. Making the rule table ... used 2 trees and 0 kids. Scanning the input file sqlite3.c Parsing the input file [sqlite3.c, c2.txl] : TXL0127E input file 'sqlite3.c' - Parse time limit (401000000 cycles) exceeded [sqlite3.c, c2.txl] : TXL0191E at end of sqlite3.c - Syntax error at end of: { return SQLITE_SOURCE_ID ; } Command exited with non-zero status 1 9.60user 0.13system 0:09.73elapsed 99%CPU (0avgtext+0avgdata 360696maxresident)k 0inputs+0outputs (0major+89829minor)pagefaults 0swaps
Attached `sqlite3.c` with several lines commented out and can be diffed with the previous one if desired:
[sqlite3-hacked.zip](https://github.com/CordyJ/OpenTxl/files/14918535/sqlite3-hacked.zip)
Somehow the attachment failed in thre previous message. sqlite3-hacked.zip
Looking through the source code https://github.com/CordyJ/OpenTxl/blob/c289dec190558bd085de3e6a7e94196d731f9d93/src/parse.ch#L452 and https://github.com/CordyJ/OpenTxl/blob/c289dec190558bd085de3e6a7e94196d731f9d93/src/options.i#L544 now I can see that there is more debugging options that are not documented in https://github.com/CordyJ/OpenTxl/blob/main/docs/TXL-ug-11.3.html .
And using -Dstack
I can see that the parser seem to get stuck at the end:
/usr/bin/time txl -p -v -s 1000 -Dstack sqlie3.c c2.txl
[c2.txl] : TXL0911W - (Warning) Stack limit less than recommended for TXL size (probable cause: shell stack limit)
[c2.txl] : TXL0913I - (Information) Recursion stack limit reduced from 51200000 to 7340032 to fit
OpenTxl v11.3.2 (1.4.24) (c) 2024 James R. Cordy and others
Bootstrapping TXL ...
... used 401 trees and 251 kids.
Scanning the TXL program c2.txl
Parsing the TXL program
... used 10975 trees and 16420 kids.
Making the input language grammar tree
... used 1332 trees and 1563 kids.
Making the rule table
... used 2 trees and 0 kids.
Scanning the input file sqlite3.c
Parsing the input file
[sqlite3.c, c2.txl] : TXL0127E input file 'sqlite3.c' - Parse time limit (401000000 cycles) exceeded
=== Parse Stack Dump ===
repeat_0_block_item 1064820 const
repeat_0_block_item 1064820 const ////<<<<< 1855 lines equal this one
...
repeat_0_block_item 1064820 const
opt__block_item_list 0
declaration_or_statement 709526 SQLITE_PRIVATE
repeat_0_block_item 709526 SQLITE_PRIVATE
repeat_0_block_item 707685 SQLITE_PRIVATE
repeat_0_block_item 707647 static
repeat_0_block_item 707599 static
...
repeat_0_function_definition_or_declaration 1 #define SQLITE_CORE 1
*undefined 1 #define SQLITE_CORE 1
=== ===
[sqlite3.c, c2.txl] : TXL0191E at end of sqlite3.c - Syntax error at end of:
{ return SQLITE_SOURCE_ID ; }
Command exited with non-zero status 1
9.69user 0.18system 0:09.87elapsed 99%CPU (0avgtext+0avgdata 360828maxresident)k
0inputs+0outputs (0major+89832minor)pagefaults 0swaps
Adding line number to the stack trace to help identify the problem:
@@ -406,7 +411,7 @@ body module parser
for decreasing i : parseDepth .. 0
put : 0, string@(ident.idents (parseStack (i))), " ", parseTokenIndex (i) ..
if parseTokenIndex (i) not= 0 then
- put : 0, " ", string@(ident.idents (inputTokens (parseTokenIndex (i)).token)) ..
+ put : 0, ":", intstr (inputTokens (parseTokenIndex (i)).linenum mod maxLines, 1), " ", string@(ident.idents (inputTokens (parseTokenIndex (i)).token)) ..
end if
put :0, ""
end for
Then the output now is like this:
/usr/bin/time txl -p -v -s 1000 -Dstack sqlie3.c c2.txl
[c2.txl] : TXL0911W - (Warning) Stack limit less than recommended for TXL size (probable cause: shell stack limit)
[c2.txl] : TXL0913I - (Information) Recursion stack limit reduced from 51200000 to 7340032 to fit
OpenTxl v11.3.2 (1.4.24) (c) 2024 James R. Cordy and others
Bootstrapping TXL ...
... used 401 trees and 251 kids.
Scanning the TXL program c2.txl
Parsing the TXL program
... used 10975 trees and 16420 kids.
Making the input language grammar tree
... used 1332 trees and 1563 kids.
Making the rule table
... used 2 trees and 0 kids.
Scanning the input file sqlite3.c
Parsing the input file
[sqlite3.c, c2.txl] : TXL0127E input file 'sqlite3.c' - Parse time limit (401000000 cycles) exceeded
=== Parse Stack Dump ===
repeat_0_block_item 1064820:257072 const
repeat_0_block_item 1064820:257072 const ////<<<<< 1855 lines equal this one
...
repeat_0_block_item 1064820:257072 const
opt__block_item_list 0
declaration_or_statement 709526:178574 SQLITE_PRIVATE
repeat_0_block_item 709526:178574 SQLITE_PRIVATE
repeat_0_block_item 707685:178248 SQLITE_PRIVATE
...
repeat_0_function_definition_or_declaration 1:25 #define SQLITE_CORE 1
*undefined 1:25 #define SQLITE_CORE 1
=== ===
[sqlite3.c, c2.txl] : TXL0191E at end of sqlite3.c - Syntax error at end of:
{ return SQLITE_SOURCE_ID ; }
Command exited with non-zero status 1
9.97user 0.19system 0:10.16elapsed 99%CPU (0avgtext+0avgdata 360784maxresident)k
0inputs+0outputs (0major+89831minor)pagefaults 0swaps
And this is the piece of code that make txl
with c2.grm
going crazy:
/*
** Run the parser on the given SQL string.
*/
SQLITE_PRIVATE int sqlite3RunParser(Parse *pParse, const char *zSql){
int nErr = 0; /* Number of errors encountered */
void *pEngine; /* The LEMON-generated LALR(1) parser */
int n = 0; /* Length of the next token token */
int tokenType; /* type of the next token */
int lastTokenParsed = -1; /* type of the previous token */
sqlite3 *db = pParse->db; /* The database connection */
int mxSqlLen; /* Max length of an SQL string */
Parse *pParentParse = 0; /* Outer parse context, if any */
#ifdef sqlite3Parser_ENGINEALWAYSONSTACK
yyParser sEngine; /* Space to hold the Lemon-generated Parser object */
#endif
VVA_ONLY( u8 startedWithOom = db->mallocFailed );
assert( zSql!=0 );
mxSqlLen = db->aLimit[SQLITE_LIMIT_SQL_LENGTH];
if( db->nVdbeActive==0 ){
AtomicStore(&db->u1.isInterrupted, 0);
}
pParse->rc = SQLITE_OK;
pParse->zTail = zSql;
#ifdef SQLITE_DEBUG
if( db->flags & SQLITE_ParserTrace ){
printf("parser: [[[%s]]]\n", zSql);
sqlite3ParserTrace(stdout, "parser: ");
}else{
sqlite3ParserTrace(0, 0);
}
#endif
#ifdef sqlite3Parser_ENGINEALWAYSONSTACK
pEngine = &sEngine;
sqlite3ParserInit(pEngine, pParse);
#else
pEngine = sqlite3ParserAlloc(sqlite3Malloc, pParse);
if( pEngine==0 ){
sqlite3OomFault(db);
return SQLITE_NOMEM_BKPT;
}
#endif
assert( pParse->pNewTable==0 );
assert( pParse->pNewTrigger==0 );
assert( pParse->nVar==0 );
assert( pParse->pVList==0 );
pParentParse = db->pParse;
db->pParse = pParse;
while( 1 ){
n = sqlite3GetToken((u8*)zSql, &tokenType);
mxSqlLen -= n;
if( mxSqlLen<0 ){
pParse->rc = SQLITE_TOOBIG;
pParse->nErr++;
break;
}
#ifndef SQLITE_OMIT_WINDOWFUNC
if( tokenType>=TK_WINDOW ){
assert( tokenType==TK_SPACE || tokenType==TK_OVER || tokenType==TK_FILTER
|| tokenType==TK_ILLEGAL || tokenType==TK_WINDOW
|| tokenType==TK_QNUMBER
);
#else
if( tokenType>=TK_SPACE ){
assert( tokenType==TK_SPACE || tokenType==TK_ILLEGAL
|| tokenType==TK_QNUMBER
);
#endif /* SQLITE_OMIT_WINDOWFUNC */
if( AtomicLoad(&db->u1.isInterrupted) ){
pParse->rc = SQLITE_INTERRUPT;
pParse->nErr++;
break;
}
if( tokenType==TK_SPACE ){
zSql += n;
continue;
}
if( zSql[0]==0 ){
/* Upon reaching the end of input, call the parser two more times
** with tokens TK_SEMI and 0, in that order. */
if( lastTokenParsed==TK_SEMI ){
tokenType = 0;
}else if( lastTokenParsed==0 ){
break;
}else{
tokenType = TK_SEMI;
}
n = 0;
#ifndef SQLITE_OMIT_WINDOWFUNC
}else if( tokenType==TK_WINDOW ){
assert( n==6 );
tokenType = analyzeWindowKeyword((const u8*)&zSql[6]);
}else if( tokenType==TK_OVER ){
assert( n==4 );
tokenType = analyzeOverKeyword((const u8*)&zSql[4], lastTokenParsed);
}else if( tokenType==TK_FILTER ){
assert( n==6 );
tokenType = analyzeFilterKeyword((const u8*)&zSql[6], lastTokenParsed);
#endif /* SQLITE_OMIT_WINDOWFUNC */
}else if( tokenType!=TK_QNUMBER ){
Token x;
x.z = zSql;
x.n = n;
sqlite3ErrorMsg(pParse, "unrecognized token: \"%T\"", &x);
break;
}
}
pParse->sLastToken.z = zSql;
pParse->sLastToken.n = n;
sqlite3Parser(pEngine, tokenType, pParse->sLastToken);
lastTokenParsed = tokenType;
zSql += n;
assert( db->mallocFailed==0 || pParse->rc!=SQLITE_OK || startedWithOom );
if( pParse->rc!=SQLITE_OK ) break;
}
assert( nErr==0 );
#ifdef YYTRACKMAXSTACKDEPTH
sqlite3_mutex_enter(sqlite3MallocMutex());
sqlite3StatusHighwater(SQLITE_STATUS_PARSER_STACK,
sqlite3ParserStackPeak(pEngine)
);
sqlite3_mutex_leave(sqlite3MallocMutex());
#endif /* YYDEBUG */
#ifdef sqlite3Parser_ENGINEALWAYSONSTACK
sqlite3ParserFinalize(pEngine);
#else
sqlite3ParserFree(pEngine, sqlite3_free);
#endif
if( db->mallocFailed ){
pParse->rc = SQLITE_NOMEM_BKPT;
}
if( pParse->zErrMsg || (pParse->rc!=SQLITE_OK && pParse->rc!=SQLITE_DONE) ){
if( pParse->zErrMsg==0 ){
pParse->zErrMsg = sqlite3MPrintf(db, "%s", sqlite3ErrStr(pParse->rc));
}
sqlite3_log(pParse->rc, "%s in \"%s\"", pParse->zErrMsg, pParse->zTail);
nErr++;
}
pParse->zTail = zSql;
#ifndef SQLITE_OMIT_VIRTUALTABLE
sqlite3_free(pParse->apVtabLock);
#endif
if( pParse->pNewTable && !IN_SPECIAL_PARSE ){
/* If the pParse->declareVtab flag is set, do not delete any table
** structure built up in pParse->pNewTable. The calling code (see vtab.c)
** will take responsibility for freeing the Table structure.
*/
sqlite3DeleteTable(db, pParse->pNewTable);
}
if( pParse->pNewTrigger && !IN_RENAME_OBJECT ){
sqlite3DeleteTrigger(db, pParse->pNewTrigger);
}
if( pParse->pVList ) sqlite3DbNNFreeNN(db, pParse->pVList);
db->pParse = pParentParse;
assert( nErr==0 || pParse->rc!=SQLITE_OK );
return nErr;
}
Hi, in order to allow the simplest and easiest to understand grammars that simple reflect the language's reference grammar, so that transformation patterns are simple and intuitive to read and write, TXL purposely uses a context-free full-backtracking top-down recursive descent parser. (A bison-style "compiler" grammar would be simply unusable for example-based pattern transformation tasks.)
Since full backtracking is unbounded and exponential in the worst case, depending on how ambiguous the language's grammar is, syntax errors can and do get detected by resource exhaustion. It's better than simply freezing in full CPU for days or weeks. (As AI folks say, on a clear day you can backtrack forever.) So what you observed is normal for TXL.
Most modern languages have good designs with unambiguous syntax, so a TXL grammar for them is well behaved and you'll never run into this problem. C on the other hand is wildly ambiguous. Simple example: is x;
a declaration, or a statement? (It's both, depending on the context.) Technically, C is not actually parseable context-free. c.grm
does a best-effort approximation that works almost all of the time, but it still misparses some things.
A problem with backtracking recursive-descent parsers is that when they fail, it's difficult to tell if it's a real syntax error in the input source, or if the grammar is incorrect or missing some syntax that the input source uses.
C is also a multi-level language, with a preprocessor that's interpreted as text lines and can't be explained by any grammar, followed by a context-dependent scanner (!) for lexical tokens that feeds back from the parser. TXL needs to deal with the entire source for transformations to be useful, more or less all at once, which introduces even more ambiguity into the grammar.
Finally, I should point out that TXL is not intended to be a syntax checker. It is designed to process large collections of production source code in order to analyze or transform it. So syntax errors should be a very rare thing, and almost always they are because the grammar needs updating.
By the way, to process C code with preprocessor statements in it, you should be using a combination of ifdef.txl
with c.txl
. c.grm
is not intended to parse C preprocessor code, it's not a compiler.
Thank you again for reply !
I can understand your point, but I also think that this issue/grammar/c_source is giving a good opportunity to review and possibly enhance txl
behavior with bad grammars.
Like now that I isolated one small portion (https://github.com/CordyJ/OpenTxl/issues/32#issuecomment-2045208612) of sqlite3.c
that exhibits the Parse time limit (52200000 cycles) exceeded
and looking at it's stack trace I can see it stook in the same line (at the end of the file) without making any progress and when I look in src/parse.ch
I can see that there is some intention to detect a scenario like the one mentioned here
https://github.com/CordyJ/OpenTxl/blob/c289dec190558bd085de3e6a7e94196d731f9d93/src/parse.ch#L80-L82
so maybe this issue/example can help review if the code that supposedly already exists can be enhanced/fixed to catch issues like this one.
/usr/bin/time txl -p -v -s 1000 -Dstack trouble.c c2.txl
OpenTxl v11.3.2 (1.4.24) (c) 2024 James R. Cordy and others
Bootstrapping TXL ...
... used 401 trees and 251 kids.
Scanning the TXL program c2.txl
Parsing the TXL program
... used 10975 trees and 16420 kids.
Making the input language grammar tree
... used 1332 trees and 1563 kids.
Making the rule table
... used 2 trees and 0 kids.
Scanning the input file test.c
Parsing the input file
[test.c, c2.txl] : TXL0127E input file 'test.c' - Parse time limit (52200000 cycles) exceeded
=== Parse Stack Dump ===
macro_argument 0
repeat_0_macro_argument 849:156 pParse
repeat_0_macro_argument 849:156 pParse
repeat_0_macro_argument 849:156 pParse
repeat_0_macro_argument 849:156 pParse
macro_declaration_or_statement 0
declaration_or_statement 0
repeat_0_block_item 855:156 ;
repeat_0_block_item 855:156 ;
repeat_0_block_item 855:156 ;
repeat_0_block_item 855:156 ;
repeat_0_block_item 855:156 ;
repeat_0_block_item 855:156 ;
repeat_0_block_item 855:156 ;
repeat_0_block_item 855:156 ;
repeat_0_block_item 855:156 ;
repeat_0_block_item 855:156 ;
repeat_0_block_item 855:156 ;
repeat_0_block_item 855:156 ;
repeat_0_block_item 855:156 ;
repeat_0_block_item 855:156 ;
repeat_0_block_item 855:156 ;
repeat_0_block_item 855:156 ;
repeat_0_block_item 855:156 ;
repeat_0_block_item 855:156 ;
repeat_0_block_item 855:156 ;
repeat_0_block_item 855:156 ;
repeat_0_block_item 855:156 ;
repeat_0_block_item 855:156 ;
repeat_0_block_item 855:156 ;
repeat_0_block_item 855:156 ;
repeat_0_block_item 855:156 ;
opt__block_item_list 0
sub_statement 0
while_statement 0
structured_statement 0
unlabeled_statement 0
statement 0
declaration_or_statement 0
repeat_0_block_item 860:159 empty
repeat_0_block_item 860:159 empty
repeat_0_block_item 860:159 empty
repeat_0_block_item 860:159 empty
repeat_0_block_item 860:159 empty
repeat_0_block_item 860:159 empty
repeat_0_block_item 860:159 empty
repeat_0_block_item 860:159 empty
repeat_0_block_item 860:159 empty
repeat_0_block_item 860:159 empty
repeat_0_block_item 860:159 empty
repeat_0_block_item 860:159 empty
repeat_0_block_item 860:159 empty
repeat_0_block_item 860:159 empty
repeat_0_block_item 860:159 empty
repeat_0_block_item 860:159 empty
repeat_0_block_item 860:159 empty
repeat_0_block_item 860:159 empty
repeat_0_block_item 860:159 empty
repeat_0_block_item 860:159 empty
repeat_0_block_item 860:159 empty
repeat_0_block_item 860:159 empty
repeat_0_block_item 860:159 empty
repeat_0_block_item 860:159 empty
repeat_0_block_item 860:159 empty
repeat_0_block_item 860:159 empty
repeat_0_block_item 860:159 empty
repeat_0_block_item 860:159 empty
repeat_0_block_item 860:159 empty
repeat_0_block_item 860:159 empty
opt__block_item_list 0
function_definition_or_declaration 1:4 SQLITE_PRIVATE
repeat_0_function_definition_or_declaration 1:4 SQLITE_PRIVATE
*undefined 1:4 SQLITE_PRIVATE
=== ===
[test.c, c2.txl] : TXL0191E at end of test.c - Syntax error at end of:
; return nErr ; }
Command exited with non-zero status 1
1.39user 0.02system 0:01.42elapsed 99%CPU (0avgtext+0avgdata 34576maxresident)k
0inputs+0outputs (0major+8285minor)pagefaults 0swaps
By the way, changing the size never improves parsing. You should use the default size.
Thank you again for reply ! I can understand your point, but I also think that this issue/grammar/c_source is giving a good opportunity to review and possibly enhance
txl
behavior with bad grammars.
This problem has been studied extensively for over 30 years and there is no general solution any better than the one we have. If we tighten the limits and/or cutoffs, we reduce the range of grammars and languages we can handle.
While trying to parse
sqlite3.c
I'm noticing that when the grammar get lost while parsingtxl
seems to enter in a kind of heavy loop trying to maybe recover till it gives up.With full
sqlite3.c
it takes 149MB and 159s to show an error message. With cutsqlite3s.c
it takes 54MB and 0.75s to complete the parsing.Example (after modify
sqlite3.c
to go keep going parsing):Now with a cut down of
sqlite3.c
till containing the first 45073 lines (a bit less than the error line reported above):See attached the files used to test: sqlite-c18.zip