LionWeb-io / lionweb-repository

Reference implementation of LionWeb repository
Apache License 2.0
2 stars 1 forks source link

Storing and retrieving very large trees #51

Open ftomassetti opened 2 months ago

ftomassetti commented 2 months ago

When retrieving a tree slightly over 500 MB, I get a failure.

Note that this tree is simply the AST of a single (admittedly large) file. It is not like I am asking for a tree representing a huge codebase, so conceptually I would expect to be able to retrieve this tree.

If I specify a BODY_LIMIT of 500mb (BODY_LIMIT=500mb PGHOST=localhost npm run dev), I get this:

PayloadTooLargeError: request entity too large (received 524304384, when limit is)
    at Gunzip.onData (/Users/ftomassetti/repos/lionweb-repository/node_modules/raw-body/index.js:260:12)
    at Gunzip.emit (node:events:513:28)
    at addChunk (node:internal/streams/readable:324:12)
    at readableAddChunk (node:internal/streams/readable:297:9)
    at Readable.push (node:internal/streams/readable:234:10)
    at Zlib.processCallback (node:zlib:566:10)

This is expected.

So I just increase the BODY_LIMIT to 750mb. At this point, I get the unexpected failure:

Api.getNodes: {"ids":["tenant-strumenta___EGL-myID"]} depth 1 clientId: [object Object]
/Users/ftomassetti/repos/lionweb-repository/node_modules/raw-body/index.js:266
      buffer += decoder.write(chunk)
                        ^

RangeError: Invalid string length
    at Gunzip.onData (/Users/ftomassetti/repos/lionweb-repository/node_modules/raw-body/index.js:266:25)
    at Gunzip.emit (node:events:513:28)
    at addChunk (node:internal/streams/readable:324:12)
    at readableAddChunk (node:internal/streams/readable:297:9)
    at Readable.push (node:internal/streams/readable:234:10)
    at Zlib.processCallback (node:zlib:566:10)

The code that is compressing the response seems to fail. This may be related to this issue https://github.com/nodejs/node/issues/35973 which says that V8 has an issue dealing with data > 512MB.

Now, even if we could get this solved somehow, I wonder if we should aim to send smaller responses and have some sort of paginated answer. Perhaps when getting a very large SerializationChunk we could send:

Something like:

// answer sent for a retrieve request
{
   data: SerializationChunk,
   remainingNodeIDs: [string] // this should be empty for "normal" requests
}

At this point, the Client could ask for the other nodes (using the existing retrieve call) and then recompose the tree on its own.

This would be a slightly more complex client, but I think that many clients could have issues dealing with responses this large.

What do you think @joswarmer and @dslmeinte ?

joswarmer commented 2 months ago

For answers this large some kind of pagination is probably a good idea anyway. Your approach might work, just have to make sure the number of ids sent won't get too big. An alternative is to do pagination similar to how databases do pagination. Not sure about the pro's and con's.

dslmeinte commented 2 months ago

But how would you know beforehand whether you're requesting a large number of (large) nodes, and might want to paginate them? Or would you just always paginate with a large window–thousands of nodes, probably. Would that window always suffice? Given that there is no trivial linear ordering on nodes, what order would you use? And can order change in between paginated calls?

Just some “random” questions...

joswarmer commented 2 months ago

See e.g. https://www.cybertec-postgresql.com/en/pagination-problem-total-result-count/

dslmeinte commented 2 months ago

Could we draw some inspiration from the TCP/IP protocol which essentially solves the same problem?

joswarmer commented 2 months ago

@ftomassetti The nice things about your proposed solution is that we don not need pagination at the server side at all, as the client can just ask for the nodes in the remainingNodeIds array.

This is interesting because pagination at the server side seem to imply (after a little bit of reading) either rerunning the entire query for each page, or storing the results of the query at the server side (with the risk of forgetting to explicitly delete these).

ftomassetti commented 2 months ago

@ftomassetti The nice things about your proposed solution is that we don not need pagination at the server side at all, as the client can just ask for the nodes in the remainingNodeIds array.

Yes, my goal is to make this the server not having to "remember" about the query.

But how would you know beforehand whether you're requesting a large number of (large) nodes, and might want to paginate them? Or would you just always paginate with a large window–thousands of nodes, probably.

Perhaps you can specify an optional limit, with a default limit being sufficiently high

Would that window always suffice? Given that there is no trivial linear ordering on nodes, what order would you use? And can order change in between paginated calls?

Well, in the original chunk we have some order of the nodes. We can use that. The problem is then remembering that ordering, but we can delegate that to the client passing remainingNodeIds

Could we draw some inspiration from the TCP/IP protocol which essentially solves the same problem?

I am afraid I will need to refresh my memory about TCP/IP :D

just have to make sure the number of ids sent won't get too big.

That is indeed an issue. If someone would ask a crazy number of IDs, even the list of IDs alone could be too big. But in general I think that an ID is probably 1/100 or less of the size of a single node, so I think it means we should be able to serve requests that would at least 100 times bigger (something like 50GB)

ftomassetti commented 2 months ago

Another couple of options:

enikao commented 2 months ago

We could consider using a binary format. While it is true that we compress the JSON we send out (so that the payload transmitted is not that huge) here the problem is that the original JSON is so big that we are not able to compress it. Perhaps moving to another format could reduce the payload by one or two orders of magnitude

This sounds like an interesting idea. Wrapping the same data structures in e.g. protobuf should not change the semantics.

joswarmer commented 2 months ago

If the limit is on about 50GB I don't think this is going to be a problem for a long time. I am happy to take your approach until we run into problems (or not).

dslmeinte commented 2 months ago

“Scalability is a luxury problem.” — source unknown (maybe even me)

enikao commented 2 months ago

I've dabbled with protobuf-based serialization in Java.

I tried different optimizations with a very simple library M1 and a test language M2. File sizes in bytes (raw / compressed with tar -cz9f)

optimization library size testlang size
existing JSON 3.846 / 592 163.840 / 5.497
simple protobuf 619 / 450 16.280 / 4.309
+ compacted ids 622 / 493 14.385 / 4.469
+ string lookup 636 / 515 14.555 / 4.800
+ native prop 620 / 504 14.534 / 4.789

"simple protobuf" uses lookup tables for ids/keys and meta-pointers. "compacted ids" stores ids not as string, but as binary (we have a limited character set, so we can compact 4 characters into 3 bytes). "string lookup" uses a lookup table for all strings (property values and resolveInfo). "native prop" stores boolean/int32 properties in their native protobuf format. Each optimization includes all the ones above.

My takeaway from this: Compression is still more effective (duh!). I think we don't need a binary format, just make sure we're always working on byte streams, never convert stuff into strings. For big models, these byte streams should directly be compressed, so we should not have memory issues.

dslmeinte commented 2 months ago

I'm not surprised by these findings, as compression is really good and sophisticated these days. Combining a binary format and compressing that might eke out a couple of promilles still, but I'd agree beforehand it's not worth the effort.

The only way I could see a use/need for a binary format is that deserializing from one could be optimized better than deserializing from the compressed, streaming but still chatty LionWeb JSON format.

enikao commented 2 months ago

Maybe we'd have a very easy first step: Don't pretty-print -- I'm not sure whether lw-repo does that?

dslmeinte commented 2 months ago

Even if it did, the difference is quite small after compression. But it's an easy win/gain, and clients should be responsible for presenting JSON payload in a suitable (pretty-printed) way.

ftomassetti commented 2 months ago

Data about the problem

Some data about the file being parsed: it has 31,301 lines. So it is definitely a big file, but unfortunately, some codebases seem to have files of that size.

The number of LionWeb nodes present in the tree is 542,674.

The JSON file (before compression) is 829,441,584 bytes long. It compressed down to 11,375,100 bytes (not bad!)

Suggestion to reduce the size of trees

I can also change the way in which ASTs are converted into LionWeb nodes. Each AST node has a Position composed of a start and an end Point. We are currently creating a LionWeb node for the Position and each Point, so we are creating 4 LionWeb nodes for each AST Node. Treating Positions as primitive types in LionWeb would probably reduce the file size being parsed by almost 4. This could be an optimization advice to give to users: when possible, use primitive values and not nodes.

Realization: we have the problem on receiving the big chunk of data

While I thought the error I was seeing was on the LionWeb Repository to send back the big tree, I am now realizing that we are having the problem in storing the AST. So we probably need to look at both problems.

enikao commented 2 months ago

While I thought the error I was seeing was on the LionWeb Repository to send back the big tree, I am now realizing that we are having the problem in storing the AST. So we probably need to look at both problems.

Yes, the Java library does too much string handling. Same there: Avoid strings, work on byte streams.

enikao commented 2 months ago

@ftomassetti Maybe this branch helps: https://github.com/LionWeb-io/lionweb-java/tree/niko/fix-string-serialization

enikao commented 2 months ago

The aforementioned branch can serialize a model with 760.000 nodes with 500 MB of heap. The resulting file is 705 MB with indentation, 362 MB without.

ftomassetti commented 2 months ago

The aforementioned branch can serialize a model with 760.000 nodes with 500 MB of heap. The resulting file is 705 MB with indentation, 362 MB without.

I am seeing the exception on the Node side: from Java, with the version in master, I can create the model, but then the Gzip interceptor on the Node side causes the explosion when trying to decompress the zip. Your changes could probably make it faster, but to solve the exception I think we need to send less data to Node

enikao commented 2 months ago

I guess you would need to apply similar optimizations on the node side:

ftomassetti commented 2 months ago

I guess you would need to apply similar optimizations on the node side:

  • (De-)serialize with a streaming API

Yes, but the decompression is performed through an interceptor that seems to be built-in express. The server explodes before we get a chance to serve the request with user code. I would not know how to make express run the unzip process to produce a stream and not a string.

The stacktrace is this:

    at Gunzip.onData (/Users/ftomassetti/repos/lionweb-repository/node_modules/raw-body/index.js:260:12)
    at Gunzip.emit (node:events:513:28)
    at addChunk (node:internal/streams/readable:324:12)
    at readableAddChunk (node:internal/streams/readable:297:9)
    at Readable.push (node:internal/streams/readable:234:10)
    at Zlib.processCallback (node:zlib:566:10)

I'm not sure about the repo implementation, but maybe you could go one step further: In Java, I built the full node tree in memory (as that's my "source of truth"). The repo has everything in tables, so maybe it could skip building the node tree completely and stream JSON nodes directly. Collecting the used languages might be possible with a separate SQL query -- databases tend to be pretty good at aggregating (-:

I am not sure I understand this point. On the server side I have the tree in memory, and I produce the JSON in order to send it to the server

So far the only idea that I can't think of, that is a form of streaming is this:

Option I: server aware of partial requests

We can make the server aware that this is a partial request and having it waiting to get all the parts before storing those on disk. Now that is more work for the server because it needs to keep this data in memory and possibly handle some sort of timeout to clean the memory (in case the partial requests are never completed)

Option II: hiding to the server the fact that we are sending partial requests

We could have the client split the request into single requests which are individually valid. To do so we would need to remove all pointers to nodes not yet sent. So, if I am about to send a node N1, with references or containments referring to a node N2, which I am not going to send right away, then I have to change N1 in my request, removing the pointer to N2.

Later on, when I send N2, I will need to go back and say to the server that N1 actually had a pointer to N2. This may be a bit of work, but all of the work would be on the client. The server would receive a set of requests that would be individually valid.

The drawback of this approach is that, in case N2 already exist, I would effectively delete it in between my requests. Eventually I will get the expected status on the server. Also, we may have for any reason only some of the requests go through (e.g., the client or the server die when we sent half of the requests we need to send). This could leave us in an inconsistent state

Option III: use web-sockets

Perhaps if we use web sockets we can send block of nodes and receive all of them on the server before processing the storing request

enikao commented 2 months ago

I guess you would need to apply similar optimizations on the node side:

  • (De-)serialize with a streaming API

Yes, but the decompression is performed through an interceptor that seems to be built-in express. The server explodes before we get a chance to serve the request with user code. I would not know how to make express run the unzip process to produce a stream and not a string.

Sorry, I don't know anything about JavaScript ecosystem or this tool, so I don't know.

I'm not sure about the repo implementation, but maybe you could go one step further: In Java, I built the full node tree in memory (as that's my "source of truth"). The repo has everything in tables, so maybe it could skip building the node tree completely and stream JSON nodes directly. Collecting the used languages might be possible with a separate SQL query -- databases tend to be pretty good at aggregating (-:

I am not sure I understand this point. On the server side I have the tree in memory, and I produce the JSON in order to send it to the server

This is specific to Jos' repository implementation. In there, we might get away with never instantiating the whole tree in memory, but only forwarding single nodes from the database to the output stream.

enikao commented 2 months ago

There seem to be JS packages for that:

enikao commented 2 months ago

Branch https://github.com/LionWeb-io/lionweb-repository/tree/niko/trial-big-result contains a seriously ugly stack of hacks, but it does serve 235 MB of JSON from a node process that never takes more than 59 MB of memory. No compression, it serves the same node 600.000 times.

enikao commented 2 months ago

Now my repository hack can also receive 700 MB sized models:

Server is running at port 3005 =========================================================
WARNING! The server is not protected by a token. It can be accessed freely. If that is NOT your intention act accordingly.
 * createPartitions request received, with body of 739658239 bytes