embulk / embulk-base-restclient

Base class library for Embulk plugins to access RESTful services
https://www.embulk.org/
Apache License 2.0
6 stars 7 forks source link

Fix: Null values are not being handled properly for numeric types and booleans #102

Closed kfitzgerald closed 7 years ago

kfitzgerald commented 7 years ago

This PR changes two things and adds some supporting tests.

Firstly:

I believe this to be the reason that exporting data through classes based off this module, null numeric values are being encoded either as 0 or seemingly random numbers (leaked memory from other parts of the encoded buffer).

Secondly:

In my opinion, when a column value is null, then it should be encoded in JSON as null. When you think about it, each column has a given name and a known data type. This means, if the column lacks a value, then the key is defined, but the value is null. As it currently stands, string, timestamp and JSON types are omitted if null, meaning their values are being encoded as undefined. Simply put, null is not the same as undefined.

While this latter change I believe is technically correct, it may affect other modules based upon this one, for better or for worse.

Please let me know if there is anything else I can help with regarding this issue.

Thanks, -Kevin

kfitzgerald commented 7 years ago

For a bit more background here, I ran into this issue when attempting to output records to elasticsearch (embulk-output-elasticsearch relies on this module)

When a row contains null numeric values, two things could occur:

Naturally, elasticsearch will reject records with absurdly large numbers with exceptions such as: org.elasticsearch.index.mapper.MapperParsingException: failed to parse [column_name]. So in my case, I lost about 25% of the records I was working with, and the other 75% contained bad or missing values.

I would expect null-valued columns to be stored as null values in elasticsearch, not defaulted to 0 or omitted entirely (which is the equivalent of being undefined)

dmikurube commented 7 years ago

@kfitzgerald Thank you for your contribution! Sorry to be late to respond.

Agreed for your first one. Other types were fixed in #87, but Boolean, Long, and Double still remained.

But for the second one, it's not trivial. It depends on the plugin's context. It may be nice for Elasticsearch, but can be bad for other outputs. I'm afraid it can be incompatible from previous versions...

What do you think making it an option of JacksonAllInObjectScope? It means that adding new constructor(s) like JacksonAllInObjectScope(boolean fillsJsonNullForEmbulkNull), and setting this fillsJsonNullForEmbulkNull as false in default (existing) constructors.

kfitzgerald commented 7 years ago

@dmikurube Yeah I thought about making it configurable, but hadn't explored the best way of doing it. I'm always a fan of backwards compatible updates.

I'll take a stab at this and keep you posted.

kfitzgerald commented 7 years ago

@dmikurube Ok – This should be good to go if you'd like to review.

Also, I have changes staged in the embulk-output-elasticsearch plugin that will get a PR once this module has been bumped and deployed.

Thanks again, -Kevin

muga commented 7 years ago

@kfitzgerald The change looks good to me. I did minor change your PR on https://github.com/kfitzgerald/embulk-base-restclient/pull/1. Please take a look?

kfitzgerald commented 7 years ago

LGTM – merged!