prevent invalid reads when a string literal is unterminated

rdewaele commented 4 years ago

I consider this proposed change more of a proof-of-concept fix, as I am not very familiar with the serd codebase. Its main purpose is to illustrate the problem by showing one way of preventing it from happening.

When parsing an unterminated string literal in a turtle/trig/turtle file, serd will perform invalid reads beyond the string buffer end. For example with the following input file:

<http://example.com/foo> <http://example.com/comment> """unterminated

The output is something as follows:

<http://example.com/foo> <http://example.com/comment> "unterminated\n\u0000\u0000 ...repeats a lot... u0000" .
error: invalid2.ttl:2:0: unexpected end of file

In the context I am using serd, I am passing a nul-terminated string of the exact size of the input to serd_reader_read_string. In this use-case the read_head surpasses the length of the string buffer and performs invalid reads. With serdi, it does not seem to perform invalid reads, instead producing the presumably undesired output of many nul characters.

drobilla commented 4 years ago

Hm, interesting. I'm surprised/ashamed that this isn't covered by the test suite, but I guess that doesn't really catch string buffer issues since it's all file based.

I'll look into it, thanks.

drobilla commented 4 years ago

Oh, I see. There are test cases for this and a bunch of similar things, but they don't check the output, only that serdi bails with an error.

drobilla commented 4 years ago

Okay, your fix is good given what was currently there (thanks!), but I looked a bit deeper into it and realized the character and EOF handling was pretty deeply wrong (the mix of types and the "sometimes null is EOF sometimes not" smells are symptoms of this which have been bothering me for a while).

So, I rewrote a bunch of that stuff so that EOF can be handled more sanely in general. https://github.com/drobilla/serd/commit/f7ffff1e75634909da60ea63a7c52f1a001220b8 should fix the same problem, also in some other places.

That said, there's still no test for reading string input out of bounds (I'm not sure how to do that without adding code to serdi specifically for it), so let me know if you're still seeing that.

rdewaele commented 4 years ago

Thank you for looking at & solving this so swiftly, and providing a more fundamental solution. I'll rework my (internal) conan package to make a gitref-based release and let you know how this is working out for us. (Not that I expect further issues.)

drobilla commented 4 years ago

Great, thanks / you're welcome. As it happens I spent a bunch of time recently almost completely automating the release process, so I can kick out another one if that makes life easy (though testing beforehand is of course appreciated).

Not that I expect further issues

Hey, you never know. Serd serves as a constant reminder to me that no matter how thoroughly you test something, there's always bugs somewhere :) It's written mainly for the high-throughput file parsing scenario so reading an exact string like this with invalid input could still be flaky (relatively speaking). I have some ideas how to tackle that more comprehensively though.

rdewaele commented 4 years ago

I have updated the serd version we use to a76b091. Valgrind likes all the tests a lot better now. I did notice a change in behavior when passing the empty string ("") to serd_reader_read_string:

this used to be accepted in the latest release of serd (0.30.0), but now it is rejected with an invalid document error;
it seems depending on the format (triples, quads, turtle or trig) that the code may end up in an infinite loop.

I still have to verify the suspected cause of the latter, as we could simply work around these changes by not passing the empty string to serd in the first place, saving some compute time in the process. :-)

edit: these behaviors do not occur with serdi

drobilla / serd

prevent invalid reads when a string literal is unterminated #27