drobilla / serd

A lightweight C library for RDF syntax
https://gitlab.com/drobilla/serd
ISC License
86 stars 15 forks source link

Segmentation fault when parsing a specific file #16

Closed wouterbeek closed 6 years ago

wouterbeek commented 6 years ago

I have a large (~13GB) GNU zipped TriG file. When I parse it with Serdi, I consistently get a segmentation fault after a couple of seconds. To reproduce:

$ curl -L 'https://demo.triply.cc/wouter/bag/assets/bag.trig.gz' | zcat | serdi - > /dev/null
drobilla commented 6 years ago

Hm. Don't see those every day!

Turns out this was hitting a bug where the stack grows at just the right (well, wrong) time and the system needs reallocate it at a different location. It should be fixed by https://github.com/drobilla/serd/commit/2faabcaf2760a4c909d4be2f95057711e5eb9e23

I should think of ways to harden against things like this, though sometimes I wonder if having serd use a growing stack at all was the right call (it's definitely not for embedded and resource constrained applications where I'd like it to apply, at least, and it has security implications for web-facing anything). A lot of things in the code would be much simpler, and there would be less potential for rare bugs like this to crop up, if there was just a fixed stack size limit, and if the file exceeds it - too bad. It'd be a bit faster, too. Something to think about for serd1, I suppose.

I'm not sure if this whole file parses, the error is triggered by something in the first 100 million lines so I've just checked that. If you have this file locally, try running the whole thing through serdi just to make sure and let me know how that goes.

wouterbeek commented 6 years ago

@drobilla Thanks for the quick fix! When parsing this particular file I can indeed observe that Serd is using over 25 GB of memory at some point. I'm curious as to why allocating additional memory is needed at all? TriG can be parsed with a window of one triple, but there may be a reason for storing things in memory that I am overlooking?

E.g., in the following TriG snippet, one graph is encoded starting at line 1. The graph encodes ~4M triples, e.g., at line 2M a triple starts that uses predicate/object abbreviation, such that the subject term <s:s> is repeated ~2M times, and the predicate term <q:q> is repeated ~1M times:

[1]     <g:g> {
          ...
[2M]      <s:s> <p:p> <o:o> ;
                ...........
[3M]            <q:q> "aaa" ,
                      .....
[4M]                  "bbb" .
        }

In order to parse the last triple in this snippet, Serd would have to keep only <g:g>, <s:s>, and <q:q> in memory, in order to be able to determine the correct quadruple when parsing "bbb". Is there ever a reason to keep more in memory?

drobilla commented 6 years ago

Good question. It definitely shouldn't be doing that except for very pathological graphs (memory usage should grow with only the amount actually needed to stream, so roughly with the amount of nesting). That's the main reason I wrote serd in the first place.

There must be (the equivalent of) a leak somewhere. The TriG parsing is a lot newer than the rest, I haven't really used it on anything huge yet. I'll look into it....

drobilla commented 6 years ago

Yep, leaking graphs. As of 0179f024aafea7f49a5abb459fa12a4ab2495b99 parsing this file should take 1 MiB or so (the default allocation, the stack never grows).

drobilla commented 6 years ago

(Specifically it was growing with the number of graphs in the input, and this file has a lot of them)

wouterbeek commented 6 years ago

Thanks!