Open carlsonp opened 9 months ago
Hey @carlsonp! Thanks for opening the issue and the idea presented.
This makes a ton of sense and I think fits perfectly as a feature into the DataReaders
class (as documented here).
There are two features, although not percentages, that exist for CSV and Parquet:
I think something like a percentage sampling would be a nice addition to the readers: read in sampled as desired and pass the pre-sampled data to the profiler.
Today, there seem to be 2 settings for adjusting the sample size. They are
samples_per_update
andmin_true_samples
. I can load in my file via Pandas and get the number of rows if I want to profile the whole thing. For example:I was just thinking it would be nice to add an additional flag like
samples_ratio
which would be a value between 0-1 denoting the percentage of data that you want to sample. This would mean you wouldn't have to essentially load the data in twice, you could just say I want X percentage loaded in as samples and it would go from there.