ahirreddy commented 11 months ago

This PR adds a new ResolvedFile interface that encapsulates all file reads (import, importstr, etc). Behind this a file can be backed by a string in memory (old behavior) or by a file on disk
We implement ParserInputs for each - the in-memory one is trivial and the on-disk one is backed by a RandomAccessFile (with buffered reads around it for performance)
Testing against our internal workloads - this greatly reduced memory presure as there are cases where we would previously store single files ranging between 100MB - 1GB in memory, in addition to its parsed AST.

Misc

Add support for customizing compression levels in std.xz. The default level (6) is very expensive (both in time and space) - so individual call sites can opt for lower levels
Uses (path, CRC32) for the ParseCache key. CRC32 should be sufficient to detect changes in a given file - and is significantly more performant that easily available alternatives (~10x faster than MD5)

lihaoyi-databricks commented 11 months ago

Can we use a fastparse.ReaderParserInput or IteratorParserInput here? This kind of use case is exactly what those guys are for (see Streaming Parsing docs), and it would avoid us needing to maintain our own bespoke implementation

ahirreddy commented 11 months ago

I tried using that one, but it doesn't support backtracking - which is used at some point when dealing with syntax errors

lihaoyi-databricks commented 11 months ago

ReaderParserInput is meant to support backtracking, in that it is supposed to buffer up the input text until it hits a cut that allows it to call dropBuffer. Is that not working for some reason, is it or it failing with some error? Or is it simply keeping a larger buffer than is possible than with the random access interface that you are using?

On further thought, I wonder if using a custom ParserInput is giving us any benefit here: do we actually need to avoid reading individual files into memory for parsing, or is the change of ParseCache key to (path, CRC) sufficient? I would expect that the number of files being parsed at any point in time (O(10s)) would be dwarfed by the number of files that were previously being kept in the parse cache (O(10000s)), and so maybe reading the individual files into memory during parsing is sufficient as long as we don't keep the file contents around after. This is something that's worth testing out, as if true this would the change trivial and avoid needing to maintain a bunch of fiddly ParserInput/BufferedRandomAccessFile code

ahirreddy commented 11 months ago

ReaderParserInput definitely doesn't support backtracking. It extends BufferedParserInput - for which checkTraceable‎ throws

  def checkTraceable() = throw new RuntimeException(
    s"Cannot perform `.traced` on an `${getClass.getName}`, as it needs to parse " +
      "the input a second time to collect traces, which is impossible after an " +
      "`IteratorParserInput` is used once and the underlying Iterator exhausted."
  )

I don't see any code that would make it work for Reader even if we changed checkTraceable. It think we'd need to call mark(position) on the underlying Reader and then reset() whenever we need to go backwards. That's basically what I did for our use case - just with a RandomAccessFile directly.

On further thought, I wonder if using a custom ParserInput is giving us any benefit here: do we actually need to avoid reading individual files into memory for parsing,

On this point, we actually have 100MB-1GB input JSONs that take up a huge amount of memory. Depending on the order in which you process - you could easily OOM if you concurrently buffer and parse multiple of these (depending on the order in which Bazel requests a particular compile). I figured the easiest thing here was just to avoid the in memory buffering. There are further issues with 100+MB arrays - as they'll result in heap fragmentation and large object pinning - so even with plenty of space the GC may still declare OOM as it can't move the huge objects and can lose the ability to allocate new ones.

lihaoyi-databricks commented 11 months ago

Got it. The first parse should be fine, but the call to .traced that we do on failure does a second parse that doesn't work with java.io.Reader. I think the current approach is fine then

lihaoyi-databricks commented 11 months ago

I think this looks fine. @szeiger leaving this open for a bit for you to take a look before we merge it

szeiger commented 11 months ago

Why are we keeping the parsed source files at all? AFAIR it's just for cache invalidation. Maybe we could replace them with hashes. Parsing could be done with a standard streaming parse or even load the entire file into memory temporarily.

ahirreddy commented 11 months ago

@szeiger This PR does replace the cache entry with a CRC hash (was much faster than MD5- which made a huge difference for ~1GB files). Right now the PR will keep up to 1MB files in memory, but I can change that so that we always streaming parse.

szeiger commented 11 months ago

On this point, we actually have 100MB-1GB input JSONs that take up a huge amount of memory.

We may want to skip Jsonnet parsing for these files altogether. If the file name ends with .json we could simply use uJSON and have it generate an Sjsonnet AST consisting of static objects.

databricks / sjsonnet

Avoid Storing Large Strings in Memory #194

Misc