Add argument to Profiler for samples ratio

capitalone / DataProfiler

What's in your data? Extract schema, statistics and entities from datasets

Apache License 2.0

1.44k stars 163 forks source link

Today, there seem to be 2 settings for adjusting the sample size. They are samples_per_update and min_true_samples. I can load in my file via Pandas and get the number of rows if I want to profile the whole thing. For example:

pandas_df = pd.read_parquet("myfile.parquet")
profile = Profiler(data, samples_per_update=pandas_df.shape[0])

I was just thinking it would be nice to add an additional flag like samples_ratio which would be a value between 0-1 denoting the percentage of data that you want to sample. This would mean you wouldn't have to essentially load the data in twice, you could just say I want X percentage loaded in as samples and it would go from there.

capitalone / DataProfiler

Add argument to Profiler for samples ratio #1094