Closed raphaelhoffmann closed 8 years ago
We are using deepdive sql eval ... format=json
to export data from greenplum to python. For integer and floating point columns, however, this export only worked for fields that did not contain NULL values. With this fix, NULL values get translated to python's None
value.
@raphaelhoffmann there's a check for the null character \\N
down below in main
, before convert
is called. \\N
is the default for the null character. Are we seeing it dump NULL
instead?
@shahin The issue is that in my case the null values appear inside an array, so they are not captured by the check in main
. Also, when in an array, gp seems to write nulls differently. See here:
$ psql -c "COPY (select null::int) to stdout"
\N
$ psql -c "COPY (select '{1,null}'::int[]) to stdout"
{1,NULL}
@raphaelhoffmann I see, makes sense. I can see this implementation causing unexpected results, since one might assume that \\N
represents null for integers when reading main
but in fact \\N
or NULL
represent null. Sounds like this only applies inside arrays. Could we isolate this change inside the array-handling code?
One approach might be to compose the existing conversion around line 51 with your NULL
check in a lambda:
convert_or_null = lambda x: None if x == 'NULL' else convert(x)
return map(convert_or_null, arr)
What do you think?
@shahin That looks good, thanks! I made the change. The null values are also correctly handled for timestamps and booleans now. I see that text is handled separately, but I find the following example shocking:
$ psql -c "COPY (select '{"NULL",null}'::text[]) TO STDOUT"
{NULL,NULL}
Do you know if we're able to distinguish between "NULL" and null?
@raphaelhoff wow, that is very surprising. I would not have expected it, but I think the issue there is the ::text[]
type cast. Without it we get reasonable results:
$ psql postgres -c "copy (select ARRAY['NULL',null]) to stdout"
{"NULL",NULL}
Apparently ::<type>[]
goes through every element and converts it to <type>
. They call null
a "generic constant", so I guess when you convert this generic constant to text
it's no longer a null value. Very weird.
Not sure if it helps but ddlib.util has gone through a similar pain https://github.com/HazyResearch/deepdive/pull/498
Given the complexity of the codec I think we should just ditch PG TSV and switch our wire format completely to JSON at least for non primitive values
Thanks @netj. Totally agree that we should switch to JSON eventually; for the short term, let's keep the PG TSV solution alive.
LGTM. PG TSV won't be dropped anytime soon.
…dds GP data types for floats