Feature/dataframe generator

goncaloccastro commented 2 years ago

Firstly I just want to say how good this library is, where I work this is what we have been using for a few years to test our PySpark code.

This feature adds to this library the possibility to give the user a bit more versatility while testing by making it possible to generate DataFrames according to a schema and then run certain properties over these DataFrames to make sure your functions are indeed behaving like you expect them to.

This is very much based on the wonderful spark-testing-base Scala library from Holden Karau (https://github.com/holdenk/spark-testing-base) more specifically the DataFrameGenerator class (https://github.com/holdenk/spark-testing-base/wiki/DataFrameGenerator).

However, I think of it as a first step, it’s not perfect, but it can be a good starting point.

The major change is the minimum Python version bump to 3.7, to be able to use dataclasses (https://docs.python.org/3.7/library/dataclasses.html) and also Faker (https://github.com/joke2k/faker) which depends on Python 3.6. The minimum Python version defined, 3.5, is already 7 years old so perhaps a bump is not that bad of an idea. Regarding Faker, it is used to populate the DataFrames with fake data, there are defaults for many Spark data types but this gives the user the ability to make the data for their DataFrames as custom as they want to.

Any feedback is appreciated. Thank you.

MrPowers commented 2 years ago

@goncaloccastro - thanks for opening this issue. I actually created a separate project called farsante to create DataFrames with fake data. It depends on mimesis which is apparently faster than faker. I'm reluctant to add the faker dependency to this project because I like to make these utility-type libraries dependency free. Let me know your thoughts!

goncaloccastro commented 2 years ago

@MrPowers

First of all, let me say I’m really glad to have your comment on this, as I said before I’m a big fan of chispa and your work.

Apologies, I had no idea about that other project, it looks quite interesting. Indeed, from what I read as well mimesis is faster, sometimes by a lot, compared to Faker. I chose Faker as it seemed much easier to use when trying to dynamically call providers and also the quantity of providers that exist.

Honestly, I share the same thought, I’m not a big fan of weighing down projects with extra dependencies. In this case, my thought was that, because chispa is meant for testing with Spark, and because property based testing, at least for me, is a big part of the testing I do and have been doing in all my jobs while working with Spark, I added it since generating random data with just Python, while doable, wouldn’t be as versatile as using mimesis or Faker.

Reading your comment do you think perhaps this endeavour would be better suited as something standalone, a small application that generates fake Spark DataFrames according to a Spark schema for the sole purpose of property based testing?

Thank you very much for you time.

MrPowers commented 2 years ago

Yea, I think you should put this code in a standalone project and then we can add a section to the chispa README telling people how to use your library to do property based testing. That'll give a great experience for folks that want to do property based testing, but will let other users avoid the faker dependency. Think it sounds like a good plan?

goncaloccastro commented 2 years ago

Yes, that sounds like a good plan indeed. Thank you again for your input, it’s really appreciated. I’ll close this pull request then and reopen one with only the changes to the README once I’m done with the application.

MrPowers commented 2 years ago

@goncaloccastro - awesome, thank you, looking forward to seeing your project ;)

MrPowers / chispa

Feature/dataframe generator #42