MrPowers / quinn

pyspark methods to enhance developer productivity 📣 👯 🎉
https://mrpowers.github.io/quinn/
Apache License 2.0
602 stars 96 forks source link

Sort struct columns #139

Closed jeffbrennan closed 8 months ago

jeffbrennan commented 9 months ago

addresses #109

Summary I modified the existing sort_columns() function to accommodate optional nested sorting based on a new boolean flag sort_nested and an initial check for nested datatypes in the top-level columns. sort_nested defaults to False so (hopefully) no users of the previous sort_columns() will be affected.

I put together 10 tests to try and handle different types of nested cases. If anyone wants to add some more tests I would be happy to update the function logic to deal with any failures. This includes testing different data values and ensuring they enter and exit the function unchanged aside from their relative position.

Remaining issues to work through

  1. Haven't incorporated handling of column metadata
  2. Haven't handled MapTypes
  3. The initial solution that I pulled from https://stackoverflow.com/questions/57821538/how-to-sort-columns-of-nested-structs-alphabetically-in-pyspark didn't transfer nullability information for structs and arrays. I added a workaround to this that requires traversing the old and new schemas, and then outputs a new dataframe using
    final_df = output.sparkSession.createDataFrame(output.rdd, output.schema)

    I don't think this is ideal and might have some performance impacts on larger datasets.

Happy to answer any questions! This was a fun one to figure out.

MrPowers commented 9 months ago

@SemyonSinchenko - can you please take a look when you have a sec?

jeffbrennan commented 9 months ago

Thanks for the feedback! I can work on the refactor this Sunday.

jeffbrennan commented 9 months ago

@SemyonSinchenko, I updated the function based on your comments. Let me know if it is easier to read and if you have any other questions

SemyonSinchenko commented 9 months ago

@jeffbrennan Nice work! Thanks a lot for such a contribution! May you please carefully resolve all the merge conflicts by merging main branch to your fork? @MrPowers I suggest to merge it.

jeffbrennan commented 8 months ago

@SemyonSinchenko @MrPowers, I reviewed the merge conflicts and the changes look good to me. Could someone do a quick sanity check before I merge into main?