capitalone / DataProfiler

What's in your data? Extract schema, statistics and entities from datasets
https://capitalone.github.io/DataProfiler
Apache License 2.0
1.43k stars 160 forks source link

Add argument to Profiler for samples ratio #1094

Open carlsonp opened 8 months ago

carlsonp commented 8 months ago

Today, there seem to be 2 settings for adjusting the sample size. They are samples_per_update and min_true_samples. I can load in my file via Pandas and get the number of rows if I want to profile the whole thing. For example:

pandas_df = pd.read_parquet("myfile.parquet")
profile = Profiler(data, samples_per_update=pandas_df.shape[0])

I was just thinking it would be nice to add an additional flag like samples_ratio which would be a value between 0-1 denoting the percentage of data that you want to sample. This would mean you wouldn't have to essentially load the data in twice, you could just say I want X percentage loaded in as samples and it would go from there.

taylorfturner commented 8 months ago

Hey @carlsonp! Thanks for opening the issue and the idea presented.

This makes a ton of sense and I think fits perfectly as a feature into the DataReaders class (as documented here).

There are two features, although not percentages, that exist for CSV and Parquet:

I think something like a percentage sampling would be a nice addition to the readers: read in sampled as desired and pass the pre-sampled data to the profiler.