apache / drill

Apache Drill is a distributed MPP query layer for self describing data
https://drill.apache.org/
Apache License 2.0
1.93k stars 979 forks source link

DRILL-8501: Json Conversion UDF Not Respecting System JSON Options #2921

Closed cgivre closed 2 months ago

cgivre commented 3 months ago

DRILL-8501: Json Conversion UDF Not Respecting System JSON Options

Description

The convert_fromJSON() function was ignoring Drill system configuration variables for reading JSON. This PR adds support for allTextMode and readNumbersAsDouble to this function. Once merged, the convert_fromJSON() function will follow the system settings.

I also split one of the unit test files because it had all the UDF tests mixed with NaN tests.

Documentation

This PR adds the possibility of setting allTextMode and readNumbersAsDouble in the function call itself. The new usage is:

SELECT convert_fromJSON(<field>, <allTextMode>, <readNumbersAsDouble>)

Thus you could write a query:

SELECT convert_fromJSON(jsonField, false, true) FROM dfs.somedata

Testing

Added unit tests.

jnturton commented 3 months ago

Before I approve, did you consider making these JSON parsing settings parameters of the function itself? It feels odd to me that store.json.* settings could influence UDFs too. I'm not sure why they aren't storage plugin config, rather than global config, in the first place...

cgivre commented 3 months ago

Before I approve, did you consider making these JSON parsing settings parameters of the function itself? It feels odd to me that store.json.* settings could influence UDFs too. I'm not sure why they aren't storage plugin config, rather than global config, in the first place...

@jnturton I thought about doing exactly what you're describing. Here's the thing. We started some work a while ago to get rid of all the non-EVF2 readers in Drill. It turns out that there are a few places which still use the old non-EVF JSON reader. Specifically, this UDF, the Druid Storage Plugin and the MongoDB storage plugin. I started work on Drill-8316 which addresses the Druid plugin and Drill-8329 addresses converting the UDF. Neither one of these were a high priority so they're kind of sitting at the moment.

I agree with your premise that the whole idea of having global settings for file formats (including parquet) is not the best idea.

Give me a day or two and I can add additional params to the UDF so that you can just change those settings in the query.

cgivre commented 2 months ago

@jnturton I added new versions of the UDF so that the user can specify in the function call whether they want allTextMode and the other option.

jnturton commented 2 months ago

Oh that's great, thanks for the enhancements +1.