dlang-community / Pegged

A Parsing Expression Grammar (PEG) module, using the D programming language.
534 stars 66 forks source link

Reading input from the input range (or file) #261

Open p-mitana opened 5 years ago

p-mitana commented 5 years ago

I am trying to work with big files (SQL files ~9MB in size). I have the grammar which defines a single SQL instruction (sort of). I would like to parse the instructions from the input file one by one and avoid reading the entire file in the memory with readText, but it seems like currently it is impossible to do this.

veelo commented 5 years ago

If you can split your input into instructions outside of the grammar, you can read your file however you like and let your parser parse each instruction individually.

p-mitana commented 5 years ago

If. However, with SQL I can't reasonably do it - at least unless I want to create the other lexer which will split instructions on semicolons that are not part of strings.

As parsing does not require the entire input at once (it looks char by char anyway), I believe that reading an input range should is an important feature for a parsing library.

veelo commented 5 years ago

As parsing does not require the entire input at once

But it does. A rule can only succeed once all its sub-rules succeed. The top rule cannot succeed before the entire input has been read.

veelo commented 5 years ago

I don't remember what SQL looks like, but if it is basically a list of instructions and the parser does not need to do much backtracking, you may be able to define your grammar in a way that input after the first instruction is discarded (Instruction .* eoi). Then you may be able to read a portion of your file that is guaranteed to be large enough for any instruction, parse that, then progress your moving window buffer with the parsed input length. This way you will process your file instruction-per-instruction.

p-mitana commented 5 years ago

It depends.

If I had a rule that parses the entire SQL file at once then yes - it wll suceed only if it reads all the instructions and EOI.

However, I can have the rule, that does not end with EOI - such as SQL instruction. It can succeed multiple times along one input, ant it actually does. When I parse the long string, for example:

SELECT * FROM table1;
SELECT * FROM table2;

it will succeed and parse only the first instruction. After reading the first semicolon the SQLInstruction rule will succeed and all its sub-rules will as well. Then I can cut off the ParseTree's end property and parse again.

As parser iterates over string's character until the root rule either succeeds or fails without looking further than it needs, it can read the characters from the range as long as it needs them. The only concern is the lookahead feature, but in this case a ForwardRange requirement and saving the range on lookahead could do the trick.

p-mitana commented 5 years ago

Then you may be able to read a portion of your file that is guaranteed to be large enough for any instruction

Yes, I can do this of course. But I believe it is an overcomplication - as I need to either make assumption on how long the instruction will be or make several parsing attempts if the instruction is longer then expected or preparse the file and split instructions from each other. Having the parser library read my data from a range instead of string would remove this need at all.

veelo commented 5 years ago

I see. I don't see an easy way to do this, though.

veelo commented 5 years ago

Do you know iopipe? https://www.youtube.com/watch?v=9fzttyj4JCs (I have no personal experience with it, though). If you get a parse error because the instruction is longer than your buffer, you could increase the buffer size and retry.

p-mitana commented 5 years ago

I haven't heared about it yet. May be worth trying someday.

In case of these SQL files, I will probably have to tackle the problem in a very different way, as it turned out that parsing them (in future possibly many times bigger than currently) may consume too much memory.

Anyway, thank you for help and I hope anyway, that this issue will make its way into pegged sometime :)

denizzzka commented 3 months ago

If you can split your input into instructions outside of the grammar, you can read your file however you like and let your parser parse each instruction individually.

In this case, line numbering in error messages will be broken

Hi from 2024 :-)