HazyResearch / deepdive

DeepDive
deepdive.stanford.edu
1.96k stars 539 forks source link

Adds support for null values in int/float columns in pgtsv_to_json; a… #550

Closed raphaelhoffmann closed 8 years ago

raphaelhoffmann commented 8 years ago

…dds GP data types for floats

raphaelhoffmann commented 8 years ago

We are using deepdive sql eval ... format=json to export data from greenplum to python. For integer and floating point columns, however, this export only worked for fields that did not contain NULL values. With this fix, NULL values get translated to python's None value.

alldefector commented 8 years ago

see also https://github.com/HazyResearch/deepdive/pull/549

shahin commented 8 years ago

@raphaelhoffmann there's a check for the null character \\N down below in main, before convert is called. \\N is the default for the null character. Are we seeing it dump NULL instead?

raphaelhoff commented 8 years ago

@shahin The issue is that in my case the null values appear inside an array, so they are not captured by the check in main. Also, when in an array, gp seems to write nulls differently. See here:

$ psql -c "COPY (select null::int) to stdout"
\N

$ psql -c "COPY (select '{1,null}'::int[]) to stdout"
{1,NULL}
shahin commented 8 years ago

@raphaelhoffmann I see, makes sense. I can see this implementation causing unexpected results, since one might assume that \\N represents null for integers when reading main but in fact \\N or NULL represent null. Sounds like this only applies inside arrays. Could we isolate this change inside the array-handling code?

One approach might be to compose the existing conversion around line 51 with your NULL check in a lambda:

convert_or_null = lambda x: None if x == 'NULL' else convert(x)
return map(convert_or_null, arr)

What do you think?

raphaelhoff commented 8 years ago

@shahin That looks good, thanks! I made the change. The null values are also correctly handled for timestamps and booleans now. I see that text is handled separately, but I find the following example shocking:

$ psql -c "COPY (select '{"NULL",null}'::text[]) TO STDOUT"
{NULL,NULL}

Do you know if we're able to distinguish between "NULL" and null?

shahin commented 8 years ago

@raphaelhoff wow, that is very surprising. I would not have expected it, but I think the issue there is the ::text[] type cast. Without it we get reasonable results:

$ psql postgres -c "copy (select ARRAY['NULL',null]) to stdout"
{"NULL",NULL}

Apparently ::<type>[] goes through every element and converts it to <type>. They call null a "generic constant", so I guess when you convert this generic constant to text it's no longer a null value. Very weird.

netj commented 8 years ago

Not sure if it helps but ddlib.util has gone through a similar pain https://github.com/HazyResearch/deepdive/pull/498

Given the complexity of the codec I think we should just ditch PG TSV and switch our wire format completely to JSON at least for non primitive values

raphaelhoff commented 8 years ago

Thanks @netj. Totally agree that we should switch to JSON eventually; for the short term, let's keep the PG TSV solution alive.

netj commented 8 years ago

LGTM. PG TSV won't be dropped anytime soon.