Support for nulls in fields

ivanprado commented 11 years ago

Adding support for nulls. Some schema fields could be mark as nullable. Those fields supports nulls on the field. The idea is to implement an efficient serialization and comparation, that:

Efficiently serializes null information using an small amount of bits
Does not affect the current Pangool efficiency when none of the fields is nullable.

ivanprado commented 11 years ago

Done! Now fields can contain nulls. For example:

new Schema("schema", Fields.parse("field1:int?, field2:string?"));

represents a schema with two fields, any of them can be null. "?" is used to indicate that a field can be null.

Additionally, you can select how nulls would behave when sorting:

builder.setOrderBy(OrderBy.parse("field1:desc|null_smallest, field2:asc|null_biggest");

Nulls can be the smallest possible value, or the biggest one.

The changes to the code can be seen on these commits:

15421903173b675177254e88bf20cf960d41c875 5a1afb713aca13d8717c7fc301b12a2b3a4ffd8b 07a6da46969caf55bcc708e3decf5f49a9345a9e 15421903173b675177254e88bf20cf960d41c875 a5fe35a0858b77efdd402b4842927a8e0599a205

When none of the fields in the schema is nullable, Pangool has the same behavior than before. The unique overhead introduced is one boolean comparison per tuple when serializing and deserializing, and one boolean comparision per field + 1 per tuple when comparing.

If at least one field on the schema can be nullable, then a bit field is introduced in the serialization and included as the first serialized element. Each bit in the bit field indicates if a particular nullable field is null or not. The bit field is able to handle 7 nullable fields per each byte. That is, if you have an schema with 200 fields, but only 7 of them are nullable, then only one byte would be used to represent the null information. Between 7 and 13 would be used 2 bytes, an so on.

The intermediate serialization used between the mapper and the reducer is able to handle nulls as well. Fields in different schemas used to group by or sort by don't need to share the same nullable attribute.

Up to 2 bit fields can be used when serializing the intermediate representation:

one for the common schema
one for the specific schema

Although that could have been implemented more efficiently because just one bit field could have been used for both schemas, I decided to follow that direction in benefit of code simplicity. Future improvements could be directed on make that more efficient.

In other words, using less than 8 nullable fields in a schema would have an overhead of serialization of 1 byte in the case of theTupleFiles, and between 1 or 2 bytes in the intermediate schema, depending on the inclusion of nullable fields or not the specific schema.

pereferrera commented 11 years ago

Great!

datasalt / pangool

Support for nulls in fields #18