Help with understanding DataProfiler options and performance

carlsonp commented 6 months ago

This is a bit of a question and a bit of feature request I think?

I'm trying to understand why profiling on some tables is slow. When it calculates statistics, it seems to take a long time, even for relatively small sample sizes (less than 1,000,000 rows). I started looking into the setup of the profiling and saw the Profile Options.

I started to go through and turn off calculations I don't need. For example:

profile_options = ProfilerOptions()
profile_options.set({
    "structured_options.data_labeler.is_enabled": False,
    "structured_options.correlation.is_enabled": False,
    "structured_options.multiprocess.is_enabled": True
})

profile = Profiler(data, options=profile_options, samples_per_update=sample_rows)

Is there a way to print out ALL the profile options including the defaults? This would help me debug and understand what is being calculated. From a feature standpoint, perhaps more of the objects should expose friendly printing of objects via __str__ methods?

Thanks

carlsonp commented 6 months ago

I think I made some more progress in debugging the slowness I'm seeing. When I look at the output, some columns are being marked as categorical but they have an extremely large number of unique records.

For example, loading in a ~350MB parquet file and using profiling through SnakeViz, I'm seeing a lot of time spent on perform_chi_squared_test_for_homogeneity. This makes sense because it's running this on the categorical columns. As can be seen below, one column is being marked as categorical and has 115207 unique values! It seems like there should be a cutoff, categorical to me would be like less than 500 unique values. If it's more than that, it really shouldn't be marked as categorical.

It took about 800 seconds to run the profile and write the JSON results to a file. The example dataset has 40 columns.

{
            "column_name": "mynumber",
            "data_type": "int",
            "categorical": true,
            "order": "random",
            "samples": "['50125', '138752', '147769', '222048', '141623']",
            "statistics": {
                "min": 1.0,
                "max": 998974.0,
                "mode": "[500.4865]",
                "median": 145109.1429,
                "sum": 220569076580.0,
                "mean": 126742.2764,
                "variance": 5985159591.3955,
                "stddev": 77363.8132,
                "skewness": -0.2028,
                "kurtosis": 0.8711,
                "histogram": {
                    "bin_edges": "[1.00000000e+00, 2.04807582e+03, ... , 996926.92418033, 998974.        ]",
                    "bin_counts": "[145575.,  19955.,  25882., ... ,   0.,   0., 229.]"
                },
                "quantiles": {
                    "0": 67537.8152,
                    "1": 145109.1429,
                    "2": 190766.3096
                },
                "median_abs_deviation": 60642.4089,
                "num_zeros": 0,
                "num_negatives": 0,
                "times": {
                    "min": 0.0033,
                    "max": 0.0026,
                    "sum": 0.0018,
                    "variance": 0.0115,
                    "skewness": 0.4415,
                    "kurtosis": 0.3778,
                    "histogram_and_quantiles": 0.3035,
                    "num_zeros": 0.0023,
                    "num_negatives": 0.0025
                },
                "unique_count": 115207,
                "unique_ratio": 0.0662,
                "categories": "['188991', '10000', '215', ... , '104165', '194483', '226577']",
                "gini_impurity": 0.9988,
                "unalikeability": 0.9988,
                "categorical_count": {
                    "188991": 22211,
                    "10000": 17795,
                    "215": 12911,
                    "95645": 12336,
                    "185": 10517,
                    "160835": 10334,
                    "208784": 10123,
                    "92093": 9906,
                    "151451": 8560,
                    "186": 8320,
                    ...
                    and on and on and on...

However, if I take out 5 columns that are marked as categorical that are particularly large in terms of the number of unique items and re-run my simple test now with just 35 columns, I get to 400 seconds, cutting my time in half.

Is it possible via the options to override this to prevent it from hanging? I can turn off the chi2_homogeneity via the options. What are your thoughts on putting some kind of ceiling on when something is considered categorical or not?

Thank you

carlsonp commented 6 months ago

And it looks like you have it already via max_sample_size_to_check_stop_condition and stop_condition_unique_value_ratio.

profile_options.set({
        "structured_options.data_labeler.is_enabled": False,
        "unstructured_options.data_labeler.is_enabled": False,
        "structured_options.correlation.is_enabled": False,
        "structured_options.multiprocess.is_enabled": True,
        "structured_options.chi2_homogeneity.is_enabled": False,
        "structured_options.category.max_sample_size_to_check_stop_condition": 1,
        "structured_options.category.stop_condition_unique_value_ratio": 0.001,
        "structured_options.sampling_ratio": 0.3,
        "structured_options.null_replication_metrics.is_enabled": False
    })

There doesn't appear to be a value set for these two values according to the code. It seems like it would be prudent to put a default in here?

taylorfturner commented 6 months ago

Hey @carlsonp -- first thanks for the detailed notes and documentation here.

Printing out all options (default or set values) is a solid idea. Using the __str__ method or a def get() method could be a great route to go to improve this UX.
A global library setting for this is an option (instead of None). Something else to be aware of is we do have, what we call, a preset option called lower_memory_sketching here. Can utilize it here options = ProfilerOptions(presets="lower_memory_sketching") which sets the stop_condition_unique_value_ratio to .50. A default value for for the entire library is an option though. I would imagine if you set this value and re-ran with the columns you removed the time would be less than the original 800 seconds.

capitalone / DataProfiler

Help with understanding DataProfiler options and performance #1098