Open ftomassetti opened 2 months ago
For answers this large some kind of pagination is probably a good idea anyway. Your approach might work, just have to make sure the number of ids sent won't get too big. An alternative is to do pagination similar to how databases do pagination. Not sure about the pro's and con's.
But how would you know beforehand whether you're requesting a large number of (large) nodes, and might want to paginate them? Or would you just always paginate with a large window–thousands of nodes, probably. Would that window always suffice? Given that there is no trivial linear ordering on nodes, what order would you use? And can order change in between paginated calls?
Just some “random” questions...
Could we draw some inspiration from the TCP/IP protocol which essentially solves the same problem?
@ftomassetti The nice things about your proposed solution is that we don not need pagination at the server side at all, as the client can just ask for the nodes in the remainingNodeIds
array.
This is interesting because pagination at the server side seem to imply (after a little bit of reading) either rerunning the entire query for each page, or storing the results of the query at the server side (with the risk of forgetting to explicitly delete these).
@ftomassetti The nice things about your proposed solution is that we don not need pagination at the server side at all, as the client can just ask for the nodes in the
remainingNodeIds
array.
Yes, my goal is to make this the server not having to "remember" about the query.
But how would you know beforehand whether you're requesting a large number of (large) nodes, and might want to paginate them? Or would you just always paginate with a large window–thousands of nodes, probably.
Perhaps you can specify an optional limit, with a default limit being sufficiently high
Would that window always suffice? Given that there is no trivial linear ordering on nodes, what order would you use? And can order change in between paginated calls?
Well, in the original chunk we have some order of the nodes. We can use that. The problem is then remembering that ordering, but we can delegate that to the client passing remainingNodeIds
Could we draw some inspiration from the TCP/IP protocol which essentially solves the same problem?
I am afraid I will need to refresh my memory about TCP/IP :D
just have to make sure the number of ids sent won't get too big.
That is indeed an issue. If someone would ask a crazy number of IDs, even the list of IDs alone could be too big. But in general I think that an ID is probably 1/100 or less of the size of a single node, so I think it means we should be able to serve requests that would at least 100 times bigger (something like 50GB)
Another couple of options:
We could consider using a binary format. While it is true that we compress the JSON we send out (so that the payload transmitted is not that huge) here the problem is that the original JSON is so big that we are not able to compress it. Perhaps moving to another format could reduce the payload by one or two orders of magnitude
This sounds like an interesting idea. Wrapping the same data structures in e.g. protobuf should not change the semantics.
If the limit is on about 50GB I don't think this is going to be a problem for a long time. I am happy to take your approach until we run into problems (or not).
“Scalability is a luxury problem.” — source unknown (maybe even me)
I've dabbled with protobuf-based serialization in Java.
I tried different optimizations with a very simple library M1 and a test language M2.
File sizes in bytes (raw / compressed with tar -cz9f
)
optimization | library size | testlang size |
---|---|---|
existing JSON | 3.846 / 592 | 163.840 / 5.497 |
simple protobuf | 619 / 450 | 16.280 / 4.309 |
+ compacted ids | 622 / 493 | 14.385 / 4.469 |
+ string lookup | 636 / 515 | 14.555 / 4.800 |
+ native prop | 620 / 504 | 14.534 / 4.789 |
"simple protobuf" uses lookup tables for ids/keys and meta-pointers. "compacted ids" stores ids not as string, but as binary (we have a limited character set, so we can compact 4 characters into 3 bytes). "string lookup" uses a lookup table for all strings (property values and resolveInfo). "native prop" stores boolean/int32 properties in their native protobuf format. Each optimization includes all the ones above.
My takeaway from this: Compression is still more effective (duh!). I think we don't need a binary format, just make sure we're always working on byte streams, never convert stuff into strings. For big models, these byte streams should directly be compressed, so we should not have memory issues.
I'm not surprised by these findings, as compression is really good and sophisticated these days. Combining a binary format and compressing that might eke out a couple of promilles still, but I'd agree beforehand it's not worth the effort.
The only way I could see a use/need for a binary format is that deserializing from one could be optimized better than deserializing from the compressed, streaming but still chatty LionWeb JSON format.
Maybe we'd have a very easy first step: Don't pretty-print -- I'm not sure whether lw-repo does that?
Even if it did, the difference is quite small after compression. But it's an easy win/gain, and clients should be responsible for presenting JSON payload in a suitable (pretty-printed) way.
Some data about the file being parsed: it has 31,301 lines. So it is definitely a big file, but unfortunately, some codebases seem to have files of that size.
The number of LionWeb nodes present in the tree is 542,674.
The JSON file (before compression) is 829,441,584 bytes long. It compressed down to 11,375,100 bytes (not bad!)
I can also change the way in which ASTs are converted into LionWeb nodes. Each AST node has a Position composed of a start and an end Point. We are currently creating a LionWeb node for the Position and each Point, so we are creating 4 LionWeb nodes for each AST Node. Treating Positions as primitive types in LionWeb would probably reduce the file size being parsed by almost 4. This could be an optimization advice to give to users: when possible, use primitive values and not nodes.
While I thought the error I was seeing was on the LionWeb Repository to send back the big tree, I am now realizing that we are having the problem in storing the AST. So we probably need to look at both problems.
While I thought the error I was seeing was on the LionWeb Repository to send back the big tree, I am now realizing that we are having the problem in storing the AST. So we probably need to look at both problems.
Yes, the Java library does too much string handling. Same there: Avoid strings, work on byte streams.
@ftomassetti Maybe this branch helps: https://github.com/LionWeb-io/lionweb-java/tree/niko/fix-string-serialization
The aforementioned branch can serialize a model with 760.000 nodes with 500 MB of heap. The resulting file is 705 MB with indentation, 362 MB without.
The aforementioned branch can serialize a model with 760.000 nodes with 500 MB of heap. The resulting file is 705 MB with indentation, 362 MB without.
I am seeing the exception on the Node side: from Java, with the version in master, I can create the model, but then the Gzip interceptor on the Node side causes the explosion when trying to decompress the zip. Your changes could probably make it faster, but to solve the exception I think we need to send less data to Node
I guess you would need to apply similar optimizations on the node side:
I guess you would need to apply similar optimizations on the node side:
- (De-)serialize with a streaming API
Yes, but the decompression is performed through an interceptor that seems to be built-in express. The server explodes before we get a chance to serve the request with user code. I would not know how to make express run the unzip process to produce a stream and not a string.
The stacktrace is this:
at Gunzip.onData (/Users/ftomassetti/repos/lionweb-repository/node_modules/raw-body/index.js:260:12)
at Gunzip.emit (node:events:513:28)
at addChunk (node:internal/streams/readable:324:12)
at readableAddChunk (node:internal/streams/readable:297:9)
at Readable.push (node:internal/streams/readable:234:10)
at Zlib.processCallback (node:zlib:566:10)
I'm not sure about the repo implementation, but maybe you could go one step further: In Java, I built the full node tree in memory (as that's my "source of truth"). The repo has everything in tables, so maybe it could skip building the node tree completely and stream JSON nodes directly. Collecting the used languages might be possible with a separate SQL query -- databases tend to be pretty good at aggregating (-:
I am not sure I understand this point. On the server side I have the tree in memory, and I produce the JSON in order to send it to the server
So far the only idea that I can't think of, that is a form of streaming is this:
We can make the server aware that this is a partial request and having it waiting to get all the parts before storing those on disk. Now that is more work for the server because it needs to keep this data in memory and possibly handle some sort of timeout to clean the memory (in case the partial requests are never completed)
We could have the client split the request into single requests which are individually valid. To do so we would need to remove all pointers to nodes not yet sent. So, if I am about to send a node N1, with references or containments referring to a node N2, which I am not going to send right away, then I have to change N1 in my request, removing the pointer to N2.
Later on, when I send N2, I will need to go back and say to the server that N1 actually had a pointer to N2. This may be a bit of work, but all of the work would be on the client. The server would receive a set of requests that would be individually valid.
The drawback of this approach is that, in case N2 already exist, I would effectively delete it in between my requests. Eventually I will get the expected status on the server. Also, we may have for any reason only some of the requests go through (e.g., the client or the server die when we sent half of the requests we need to send). This could leave us in an inconsistent state
Perhaps if we use web sockets we can send block of nodes and receive all of them on the server before processing the storing request
I guess you would need to apply similar optimizations on the node side:
- (De-)serialize with a streaming API
Yes, but the decompression is performed through an interceptor that seems to be built-in express. The server explodes before we get a chance to serve the request with user code. I would not know how to make express run the unzip process to produce a stream and not a string.
Sorry, I don't know anything about JavaScript ecosystem or this tool, so I don't know.
I'm not sure about the repo implementation, but maybe you could go one step further: In Java, I built the full node tree in memory (as that's my "source of truth"). The repo has everything in tables, so maybe it could skip building the node tree completely and stream JSON nodes directly. Collecting the used languages might be possible with a separate SQL query -- databases tend to be pretty good at aggregating (-:
I am not sure I understand this point. On the server side I have the tree in memory, and I produce the JSON in order to send it to the server
This is specific to Jos' repository implementation. In there, we might get away with never instantiating the whole tree in memory, but only forwarding single nodes from the database to the output stream.
There seem to be JS packages for that:
Branch https://github.com/LionWeb-io/lionweb-repository/tree/niko/trial-big-result contains a seriously ugly stack of hacks, but it does serve 235 MB of JSON from a node process that never takes more than 59 MB of memory. No compression, it serves the same node 600.000 times.
Now my repository hack can also receive 700 MB sized models:
Server is running at port 3005 =========================================================
WARNING! The server is not protected by a token. It can be accessed freely. If that is NOT your intention act accordingly.
* createPartitions request received, with body of 739658239 bytes
When retrieving a tree slightly over 500 MB, I get a failure.
Note that this tree is simply the AST of a single (admittedly large) file. It is not like I am asking for a tree representing a huge codebase, so conceptually I would expect to be able to retrieve this tree.
If I specify a
BODY_LIMIT
of 500mb (BODY_LIMIT=500mb PGHOST=localhost npm run dev
), I get this:This is expected.
So I just increase the
BODY_LIMIT
to 750mb. At this point, I get the unexpected failure:The code that is compressing the response seems to fail. This may be related to this issue https://github.com/nodejs/node/issues/35973 which says that V8 has an issue dealing with data > 512MB.
Now, even if we could get this solved somehow, I wonder if we should aim to send smaller responses and have some sort of paginated answer. Perhaps when getting a very large
SerializationChunk
we could send:Something like:
At this point, the Client could ask for the other nodes (using the existing retrieve call) and then recompose the tree on its own.
This would be a slightly more complex client, but I think that many clients could have issues dealing with responses this large.
What do you think @joswarmer and @dslmeinte ?