Open francoisthire opened 5 years ago
The problem is in https://github.com/Deducteam/lambdapi/blob/42a0313bc81a7092ad7988cf7ac7f5006a9b41d4/src/compile.ml#L64 .
Instead, Dedukti reads files line by line: https://github.com/Deducteam/Dedukti/blob/49e445420885c11b75d161735258ccef19ca2153/parsing/parser.ml#L17-L20 .
How about making parse_file return a Seq instead of a list? I've used this technique in the past, and it yielded very good results performance-wise.
Well, the memory remains linear in the size of the file, so I would not say that it "explodes". And even if we parse line by line that will remain the case since we typically store a new symbol or rewriting rule at each step.
The bigger problem is that Earley cannot handle files of that (unreasonable) size. And the Lambdapi syntax is definitely not meant for that kind of application.
Anyway, what do you big files look like? Are they formed of many short commands? Few bit commands? A mix of both?
Why parsing a file line by line is a problem with Earley?
That is doable I guess, it's just more complicated code.
In the long term, this will be an issue anyway. When terms are generated from a program, in general we don't need all the machinery provided by Earley. Instead, we only need a simple syntax with maybe holes for implicit parameters.
@01mf02, could you please provide an URL towards your file so that we can have a look at it?
Certainly, @fblanqui. Here is the output for HOL.Groups, which Dedukti is able to parse on my machine (typechecking fails still ATM), and here is the output for HOL.Hilbert_Choice, on which Lambdapi currently chokes.
A rough estimation (number of lines / 3) gives that the larger file (for HOL.Hilbert_Choice) contains about 35,000 commands.
Would it be possible to create an alternative version of the list1
function in earley.ml
that lazily creates a Seq
(which Lambdapi could then consume element per element as soon as a command has been parsed)? Or would this not work due to Earley's backtracking?
For reference, the current list1
:
let list1 g sep =
fsequence g
(apply (fun xs x -> x :: xs [])
(fixpoint' (fun l -> l)
(fsequence_ignore sep g)
(fun x f l -> f (x::l))))
(I tried to adapt list1
myself, but I admit that I failed to produce something meaningful that typechecks. In particular, I do not understand the meaning of xs []
.)
Hi @01mf02 .
I see that, in your big file, you use definition proofXXX : A := p
for what looks like proofs. Couldn't you instead do the following: theorem thmXXX : A proof refine p qed
? (I checked on some example and it worked.)
To make the output slightly more readable, you can also replace λ (x:A) , t
by λx:A, t
, eta by η and eps by ε.
You could also try if you cannot remove type annotations in λ's inside proofs.
@01mf02 the problem does not come from a particular combinator, nor from the fact that we do not parse the input file command by command. The huge memory consumption is due to a known issue with the parsing engine, not from the fact that we must load the whole AST into memory. Transitioning to a Seq.t
won't change a thing on that front.
As I told @fblanqui a while back, I am thinking of replacing Earley with a new tool that we are developing with Christophe Raffalli. However, I don't have time, and most importantly this new tool is not yet ready. As far as I known there is no reasonable alternative to Earley and our new tool that would support the kind of syntax we want (infix symbols, notations, ...). In particular Menhir does not allow to do that kind of thing (François Pottier confirmed that to me a few days ago).
@rlepigre Before we get your new parser, could you please upgrade the current version of Earley so that we can compile LP with the most recent version of OCaml? It should not take long, shouldn't it?
@01mf02 As this problem may persist for quite some time, could you please try to develop an incomplete but more efficient parser for LP that we could optionally use for big generated files, following what is done in menhir_parser.mly and Dedukti? Incomplete because it would not support infix symbols and tactics. In parallel, it would be good anyway to try to split such a big file into smaller ones following Isabelle graph. After Makarius, it shouldn't be too difficult.
@fblanqui it is not as easy as you may think. I started, but I can't spend time on that currently. Hopefully that will be taken care of by Christmas.
I don't think it is a good idea to develop yet another parser. It would be far better to split the files, which would definitely be possible. In any case, it is not reasonable to work with such big files. Even inspecting them with editors poses problems.
I think Earley should be able to reasonably deal with files up to 10Mo.
@01mf02 the problem does not come from a particular combinator, nor from the fact that we do not parse the input file command by command. The huge memory consumption is due to a known issue with the parsing engine, not from the fact that we must load the whole AST into memory. Transitioning to a
Seq.t
won't change a thing on that front.
There might be a misunderstanding here. My problem was not the memory consumption. My problem was that typechecking starts only once the whole file has been parsed. I would like to see typechecking being performed after every command that has been parsed.
Graphically, the current situation of running Lambdapi on some file is as follows, where P
corresponds to parsing a command and C
corresponds to executing some command:
PPPPP ... CCCCC ...
And here's how I would like it to happen:
PCPCPCPCPC ...
Indeed, this does not make a big difference if the file correctly typechecks. However, it can make a difference when typechecking fails, in which case my proposed scheme will be faster in almost all situations.
This being said, I do now generate smaller files, so this does not have a high priority for me. Still, if what I propose would be only a reasonably small change in Lambdapi (e.g., replacing List by Seq in two places), then I think it would still be worthwhile.
Fixed in #516 except in the LSP server.
The current behavior of lambdapi is: 1. parse the file to generate a list of entries, 2. type check the file.
This is becoming an issue with the importation of Isabelle/HOL into lambdapi. For example, @01mf02 is generating files of size 250mb. In this case, the size of the AST is getting big and the memory explodes at parsing stage.
For type checking, all the entries do not need to be in the memory, type checking could be done entry by entry. Sure, you need to store something to type check other entries, but in HOL for example, the proof can be thrown away.