MrPowers / quinn

pyspark methods to enhance developer productivity 📣 👯 🎉
https://mrpowers.github.io/quinn/
Apache License 2.0
602 stars 96 forks source link

Add utils for working with Spark Plan #159

Closed SemyonSinchenko closed 5 months ago

SemyonSinchenko commented 8 months ago

Two new functions:

The function, that returns the plan works like this: image

The difference with df.explain is that our function return string that may be parsed. It is a small function, but it may be used, for example, for generation of data lineage graph (when we are trying to get dependencies on the level of each column).

The function, that estimate size in bytes works like this: image

This functional is really tricky, I do not know another way to estimate the size. It is important, for example, when we need to estimate the amount of resulting partitions. Or we may use to understand where we can apply broadcast hints, etc.

Because it is absolutely new API, any feedback will be cool!

MrPowers commented 8 months ago

This is cool. I think we should add these APIs as "experimental". From what I've seen, these plans change arbitrarily over time. This code will likely break as time goes on. I don't think that's an issue if we have the experimental annotation in the docs.

I'm not sure if estimate_size_of_df should return -1 or None if the result is unknown. That's a TBD.

Looks like we need a humanize_bytes function here too: https://github.com/MrPowers/mack#humanize-bytes

Cool work!!!

SemyonSinchenko commented 6 months ago

@MrPowers Kindly reminder

SemyonSinchenko commented 5 months ago

@MrPowers Should we close it without merging?

SemyonSinchenko commented 5 months ago

Closed as very unstable API