Add Spark SQL dialect - Githubissues

gatear commented 3 weeks ago

Add Spark SQL support. See "INSERT INTO" spec https://spark.apache.org/docs/3.2.1/sql-ref-syntax-dml-insert-into.html

There are some issues with existing design in order to support all Spark types. Spark SQL has 3 complex data types:

Array
Map
Struct

Insertions look like this

-- Inserting an array
INSERT INTO MyTable (array_column) VALUES (ARRAY(1, 2, 3));

-- Inserting a map
INSERT INTO MyTable (map_column) VALUES (MAP('key1', 'value1', 'key2', 'value2'));

-- Inserting a struct
INSERT INTO MyTable (struct_column) VALUES (NAMED_STRUCT('field1', 'value1', 'field2', 'value2'));

So notable design changes are

array start and end markers are now configurable with default "[", "]". Spark requires "(" and ")".
row prefix is configurable as keyword "ROW" doesn't exist but we have "NAMED_TUPLE" for complex fields in Spark.
a named tuples requires a prefix for each value, the field name (but root columns don't require field name prefix)
there's also Map support required. I think it makes sense to correlate java.util.Map with Spark SQL Map
batch insert is not support so I think it's best to throw exception on attempt to generate "batch" statements

I tested on latest Databricks runtime this generated SQL and works well 👍

INSERT INTO `MyTable` (`bytes`) VALUES (ARRAY(1, 0));
INSERT INTO `MyTable` (`booleans`) VALUES (ARRAY(true, false));
INSERT INTO `MyTable` (`ints`) VALUES (ARRAY(1, 2, 3));

INSERT INTO `MyTable` (`struct_array`) VALUES (NAMED_STRUCT('name1', '1', 'struct', NAMED_STRUCT('name', ARRAY(1, 2, 3))));
INSERT INTO `MyTable` (`struct_struct`) VALUES (NAMED_STRUCT('name1', '1', 'struct', NAMED_STRUCT('name', '2')));
INSERT INTO `MyTable` (`longs`) VALUES (ARRAY(23, 45));
INSERT INTO `MyTable` (`empty_map`) VALUES (MAP());
INSERT INTO `MyTable` (`maps`) VALUES (MAP('k1', MAP('k1', 'v1'), 'k2', MAP('k1', 'v1')));

gatear commented 3 weeks ago

Thanks @snuyanzin for the review 🙏

I also have some questions ..

in order to port this change to faker version 1.x I need to also open a PR to branch 1.x ?
I would also invest in some documentation, where would you suggest it's best to make it ?

codecov-commenter commented 3 weeks ago

Codecov Report

Attention: Patch coverage is 95.23810% with 2 lines in your changes missing coverage. Please review.

Project coverage is 91.91%. Comparing base (b37c566) to head (57e83c6). Report is 166 commits behind head on main.

Files	Patch %	Lines
.../net/datafaker/transformations/sql/SqlDialect.java	84.61%	2 Missing :warning:

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #1261 +/- ## ============================================ - Coverage 92.35% 91.91% -0.44% - Complexity 2821 3079 +258 ============================================ Files 292 310 +18 Lines 5609 6026 +417 Branches 599 631 +32 ============================================ + Hits 5180 5539 +359 - Misses 275 325 +50 - Partials 154 162 +8 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

snuyanzin commented 3 weeks ago

Thanks for addressing feedback and for the valuable contribution

snuyanzin commented 3 weeks ago

in order to port this change to faker version 1.x I need to also open a PR to branch 1.x ?

yep, need a backport PR for that

I would also invest in some documentation, where would you suggest it's best to make it ?

depending on how large documentation update you want to make, will it be ok to add as a subsection of SQL transformation ? https://github.com/datafaker-net/datafaker/blame/f3fa54a4495fd8923e790701fab0570df8bd75b7/docs/documentation/schemas.md#L163

or ### Advanced SQL types at https://github.com/datafaker-net/datafaker/blame/981eaa714266b9246c127f3b84855270a3a00591/docs/documentation/schemas.md#L239

datafaker-net / datafaker

Add Spark SQL dialect #1261

Codecov Report