Closed cool-RR closed 5 years ago
This issue should be done via lexer.token()
written as generator function #188 :
lexer = lex.lex()
lexgen = lexer.generator(1GB) # the same as `lexer.input` but returns iterator
but we should have an ability to call parser.parse(iterator=lexgen)
without data & lexer, but with iterator function, returns every next token by lexgen.next()
call.
The all magic of partially reading really large file or infinite stream data must be done in lexer.generator() : every next token must be returned with yield
keyword, not return
. Here you don't need first to read all the data (for stream it is impossible in principle). But regexp parsing library which ply.lex
module uses, must support data acquire callback if regexp finite state automata found the end of data before its final state achieved. Every time this callback executed it must suck the next portion of unparsed data.
If we are going to parse large files (say full Linux source code) we can get short blocks of parsed data and process them sequentially one by one. But for a real stream parsing we can catch an exception in case we got opening bracket but loss closing one.
PLY lexer uses re
module for parsing a tokens. re
can parse only strings therefore it is very difficult to rewrite lexer to stream mode. For parsing streams is need to totally rewrite regex engine or find alternate regex engines that can parse streams.
I have a 1GB text file that I'd like to parse with PLY, but I don't want to load it all into memory. I wish PLY could parse a stream, and keep in RAM only the currently processed tokens.