dabeaz / ply

Python Lex-Yacc
http://www.dabeaz.com/ply/index.html
2.78k stars 465 forks source link

Support parsing a stream #185

Closed cool-RR closed 5 years ago

cool-RR commented 6 years ago

I have a 1GB text file that I'd like to parse with PLY, but I don't want to load it all into memory. I wish PLY could parse a stream, and keep in RAM only the currently processed tokens.

ponyatov commented 5 years ago

This issue should be done via lexer.token() written as generator function #188 :

lexer = lex.lex()
lexgen = lexer.generator(1GB) # the same as `lexer.input` but returns iterator

but we should have an ability to call parser.parse(iterator=lexgen) without data & lexer, but with iterator function, returns every next token by lexgen.next() call.

The all magic of partially reading really large file or infinite stream data must be done in lexer.generator() : every next token must be returned with yield keyword, not return. Here you don't need first to read all the data (for stream it is impossible in principle). But regexp parsing library which ply.lex module uses, must support data acquire callback if regexp finite state automata found the end of data before its final state achieved. Every time this callback executed it must suck the next portion of unparsed data.

If we are going to parse large files (say full Linux source code) we can get short blocks of parsed data and process them sequentially one by one. But for a real stream parsing we can catch an exception in case we got opening bracket but loss closing one.

Mazdaywik commented 5 years ago

PLY lexer uses re module for parsing a tokens. re can parse only strings therefore it is very difficult to rewrite lexer to stream mode. For parsing streams is need to totally rewrite regex engine or find alternate regex engines that can parse streams.