Closed ahirreddy closed 11 months ago
Can we use a fastparse.ReaderParserInput or IteratorParserInput
here? This kind of use case is exactly what those guys are for (see Streaming Parsing docs), and it would avoid us needing to maintain our own bespoke implementation
I tried using that one, but it doesn't support backtracking - which is used at some point when dealing with syntax errors
ReaderParserInput
is meant to support backtracking, in that it is supposed to buffer up the input text until it hits a cut that allows it to call dropBuffer
. Is that not working for some reason, is it or it failing with some error? Or is it simply keeping a larger buffer than is possible than with the random access interface that you are using?
On further thought, I wonder if using a custom ParserInput
is giving us any benefit here: do we actually need to avoid reading individual files into memory for parsing, or is the change of ParseCache
key to (path, CRC)
sufficient? I would expect that the number of files being parsed at any point in time (O(10s)) would be dwarfed by the number of files that were previously being kept in the parse cache (O(10000s)), and so maybe reading the individual files into memory during parsing is sufficient as long as we don't keep the file contents around after. This is something that's worth testing out, as if true this would the change trivial and avoid needing to maintain a bunch of fiddly ParserInput
/BufferedRandomAccessFile
code
ReaderParserInput
definitely doesn't support backtracking. It extends BufferedParserInput
- for which checkTraceable throws
def checkTraceable() = throw new RuntimeException(
s"Cannot perform `.traced` on an `${getClass.getName}`, as it needs to parse " +
"the input a second time to collect traces, which is impossible after an " +
"`IteratorParserInput` is used once and the underlying Iterator exhausted."
)
I don't see any code that would make it work for Reader even if we changed checkTraceable. It think we'd need to call mark(position)
on the underlying Reader and then reset() whenever we need to go backwards. That's basically what I did for our use case - just with a RandomAccessFile directly.
On further thought, I wonder if using a custom ParserInput is giving us any benefit here: do we actually need to avoid reading individual files into memory for parsing,
On this point, we actually have 100MB-1GB input JSONs that take up a huge amount of memory. Depending on the order in which you process - you could easily OOM if you concurrently buffer and parse multiple of these (depending on the order in which Bazel requests a particular compile). I figured the easiest thing here was just to avoid the in memory buffering. There are further issues with 100+MB arrays - as they'll result in heap fragmentation and large object pinning - so even with plenty of space the GC may still declare OOM as it can't move the huge objects and can lose the ability to allocate new ones.
Got it. The first parse should be fine, but the call to .traced
that we do on failure does a second parse that doesn't work with java.io.Reader
. I think the current approach is fine then
I think this looks fine. @szeiger leaving this open for a bit for you to take a look before we merge it
Why are we keeping the parsed source files at all? AFAIR it's just for cache invalidation. Maybe we could replace them with hashes. Parsing could be done with a standard streaming parse or even load the entire file into memory temporarily.
@szeiger This PR does replace the cache entry with a CRC hash (was much faster than MD5- which made a huge difference for ~1GB files). Right now the PR will keep up to 1MB files in memory, but I can change that so that we always streaming parse.
On this point, we actually have 100MB-1GB input JSONs that take up a huge amount of memory.
We may want to skip Jsonnet parsing for these files altogether. If the file name ends with .json we could simply use uJSON and have it generate an Sjsonnet AST consisting of static objects.
ResolvedFile
interface that encapsulates all file reads (import, importstr, etc). Behind this a file can be backed by a string in memory (old behavior) or by a file on diskParserInputs
for each - the in-memory one is trivial and the on-disk one is backed by a RandomAccessFile (with buffered reads around it for performance)Misc