[ ] The commit(s) message(s) follows the contribution guidelines ?
[ ] Tests for the changes have been added (for bug fixes / features) ?
[ ] Docs have been added / updated (for bug fixes / features) ?
Current behavior :
Distribution mode: none, does not request any shuffles or sort to be performed automatically by Spark. Because no work is done automatically by Spark, the data must be manually sorted by partition value. The data must be sorted either within each spark task, or globally within the entire dataset. A global sort will minimize the number of output files.
Also, with no sort manually done, seeing spilling sort data on disk in logs.
New behavior :
Trying to sortWithinPartitions(), to return a new Dataset with each partition sorted by the given expressions (partition columns) before writing to the table as mentioned above "the data must be manually sorted by partition value. The data must be sorted either within each spark task, or globally within the entire dataset"
Pull Request checklist
Current behavior :
spilling sort data on disk
in logs.New behavior :
Trying to
sortWithinPartitions()
, to return a new Dataset with each partition sorted by the given expressions (partition columns) before writing to the table as mentioned above "the data must be manually sorted by partition value. The data must be sorted either within each spark task, or globally within the entire dataset"