Support for PySpark - Githubissues

capitalone / DataProfiler

What's in your data? Extract schema, statistics and entities from datasets

https://capitalone.github.io/DataProfiler

Apache License 2.0

1.42k stars 158 forks source link

Support for PySpark #1055

Open gracemiguel opened 10 months ago

gracemiguel commented 10 months ago

Is your feature request related to a problem? Please describe.

Hello, I see that this package supports Pandas, but does it support pyspark? I'd like to use this on large datasets and pandas is insufficient for my use case.

Describe the outcome you'd like: I'd like to be able to run this on large datasets over 10k+ rows. Do you think this would be possible?

taylorfturner commented 10 months ago

Depends on how many columns you are also dealing with, but my first though is you should be fine at that data size with pandas, @gracemiguel. Thanks!

taylorfturner commented 10 months ago

@gracemiguel any additional questions on this? Any luck using? Thanks!