Closed wouterbeek closed 6 years ago
Hm. Don't see those every day!
Turns out this was hitting a bug where the stack grows at just the right (well, wrong) time and the system needs reallocate it at a different location. It should be fixed by https://github.com/drobilla/serd/commit/2faabcaf2760a4c909d4be2f95057711e5eb9e23
I should think of ways to harden against things like this, though sometimes I wonder if having serd use a growing stack at all was the right call (it's definitely not for embedded and resource constrained applications where I'd like it to apply, at least, and it has security implications for web-facing anything). A lot of things in the code would be much simpler, and there would be less potential for rare bugs like this to crop up, if there was just a fixed stack size limit, and if the file exceeds it - too bad. It'd be a bit faster, too. Something to think about for serd1, I suppose.
I'm not sure if this whole file parses, the error is triggered by something in the first 100 million lines so I've just checked that. If you have this file locally, try running the whole thing through serdi
just to make sure and let me know how that goes.
@drobilla Thanks for the quick fix! When parsing this particular file I can indeed observe that Serd is using over 25 GB of memory at some point. I'm curious as to why allocating additional memory is needed at all? TriG can be parsed with a window of one triple, but there may be a reason for storing things in memory that I am overlooking?
E.g., in the following TriG snippet, one graph is encoded starting at
line 1. The graph encodes ~4M triples, e.g., at line 2M a triple
starts that uses predicate/object abbreviation, such that the subject
term <s:s>
is repeated ~2M times, and the predicate term <q:q>
is
repeated ~1M times:
[1] <g:g> {
...
[2M] <s:s> <p:p> <o:o> ;
...........
[3M] <q:q> "aaa" ,
.....
[4M] "bbb" .
}
In order to parse the last triple in this snippet, Serd would have to
keep only <g:g>
, <s:s>
, and <q:q>
in memory, in order to be able
to determine the correct quadruple when parsing "bbb"
. Is there
ever a reason to keep more in memory?
Good question. It definitely shouldn't be doing that except for very pathological graphs (memory usage should grow with only the amount actually needed to stream, so roughly with the amount of nesting). That's the main reason I wrote serd in the first place.
There must be (the equivalent of) a leak somewhere. The TriG parsing is a lot newer than the rest, I haven't really used it on anything huge yet. I'll look into it....
Yep, leaking graphs. As of 0179f024aafea7f49a5abb459fa12a4ab2495b99 parsing this file should take 1 MiB or so (the default allocation, the stack never grows).
(Specifically it was growing with the number of graphs in the input, and this file has a lot of them)
Thanks!
I have a large (~13GB) GNU zipped TriG file. When I parse it with Serdi, I consistently get a segmentation fault after a couple of seconds. To reproduce: