Fix extension crashes - Githubissues

samstokes commented 8 years ago

This fixes two known cases that segfault the Bottled Water extension (and thus cause Postgres to restart all backends):

Non-alphanumeric characters in table names or column names (#53, #64) due to Avro having stricter restrictions on identifiers than Postgres
Tables containing no columns (#61) due to avro-c blowing up on zero-field record schemas

Both were segfaulting due to not checking the return values of Avro library functions.

I also added tests for long table/column names - it turns out Postgres already has a compile-time limit on identifier length, so unless people are routinely patching Postgres to increase the limit, we probably don't need to handle this case.

For invalid Avro identifiers, I chose to sanitise identifiers using an encoding similar to the "percent encoding" used in URLs. Unsupported characters are replaced by a hexadecimal representation: e.g. "person.name" -> "person_2e_name"

A couple of gotchas:

This encoding is not entirely unambiguous, since: "person_2e_name" -> "person_2ename". This could be worked around by treating '' as an invalid character and encoding it too (e.g. "person_2e_name" -> "person_5f_2e_5f_name"), but that seems ugly, especially since we ourselves generate names like "_pkey".
This encoding is not Unicode-aware! Given a string containing (bytes representing the encoded form of) non-ASCII characters, it should still return a valid Avro identifier, but its behaviour is otherwise unspecified.

For tables with no columns, ideally we'd represent this by just publishing Avro records with no fields, but that was the existing behaviour of the code, and Avro was bailing out when trying to process the record schema. I'm not sure (and the spec doesn't make it clear) whether a record with zero fields is valid Avro, but it's at least not supported by avro-c at present. So instead I just publish a "dummy" field with the value false.

msakrejda commented 8 years ago

@samstokes the #61 fix is reasonable (like I said, this is a total edge case and basically anything other than crashing is okay). For the other one, does this mean that if we have a Unicode table name like "crêpes", does that mean we can't predict the resulting topic name? I think this is probably fine for now, but it may be something we need to revisit, especially if we're working with a cluster where topic auto-creation is not turned on, so we'd need to know bottledwater's mangled topic name a priori.

samstokes commented 8 years ago

@uhoh-itsmaciek that's weird, "crêpes" was my Unicode test case too! My calling it "unspecified" was mainly punting for now just because I'm pretty sure it's not the only Unicode incompatibility, but your comment persuaded me to dig a bit deeper and I discovered a serious bug in my sanitisation function - thanks!

Now the story is a bit better. The encoding of non-ASCII characters isn't particularly intuitive, but it is deterministic: it's the underscore encoding of the bytes representing those characters in the server encoding (which I'm guessing we never change from UTF-8). e.g.: "crêpes" -> "cr_c3__aa_pes"

Good point re clusters without auto-create. This should help, since obviously everyone can do UTF-8 encoding in their heads.

msakrejda commented 8 years ago

It's the most delicious Unicode test case!

And nice, I think that's certainly good enough for now.

confluentinc / bottledwater-pg

Fix extension crashes #98