capitalone / DataProfiler

What's in your data? Extract schema, statistics and entities from datasets
https://capitalone.github.io/DataProfiler
Apache License 2.0
1.42k stars 158 forks source link

Help with understanding DataProfiler options and performance #1098

Open carlsonp opened 6 months ago

carlsonp commented 6 months ago

This is a bit of a question and a bit of feature request I think?

I'm trying to understand why profiling on some tables is slow. When it calculates statistics, it seems to take a long time, even for relatively small sample sizes (less than 1,000,000 rows). I started looking into the setup of the profiling and saw the Profile Options.

I started to go through and turn off calculations I don't need. For example:

profile_options = ProfilerOptions()
profile_options.set({
    "structured_options.data_labeler.is_enabled": False,
    "structured_options.correlation.is_enabled": False,
    "structured_options.multiprocess.is_enabled": True
})

profile = Profiler(data, options=profile_options, samples_per_update=sample_rows)

Is there a way to print out ALL the profile options including the defaults? This would help me debug and understand what is being calculated. From a feature standpoint, perhaps more of the objects should expose friendly printing of objects via __str__ methods?

Thanks

carlsonp commented 6 months ago

I think I made some more progress in debugging the slowness I'm seeing. When I look at the output, some columns are being marked as categorical but they have an extremely large number of unique records.

For example, loading in a ~350MB parquet file and using profiling through SnakeViz, I'm seeing a lot of time spent on perform_chi_squared_test_for_homogeneity. This makes sense because it's running this on the categorical columns. As can be seen below, one column is being marked as categorical and has 115207 unique values! It seems like there should be a cutoff, categorical to me would be like less than 500 unique values. If it's more than that, it really shouldn't be marked as categorical.

It took about 800 seconds to run the profile and write the JSON results to a file. The example dataset has 40 columns.

{
            "column_name": "mynumber",
            "data_type": "int",
            "categorical": true,
            "order": "random",
            "samples": "['50125', '138752', '147769', '222048', '141623']",
            "statistics": {
                "min": 1.0,
                "max": 998974.0,
                "mode": "[500.4865]",
                "median": 145109.1429,
                "sum": 220569076580.0,
                "mean": 126742.2764,
                "variance": 5985159591.3955,
                "stddev": 77363.8132,
                "skewness": -0.2028,
                "kurtosis": 0.8711,
                "histogram": {
                    "bin_edges": "[1.00000000e+00, 2.04807582e+03, ... , 996926.92418033, 998974.        ]",
                    "bin_counts": "[145575.,  19955.,  25882., ... ,   0.,   0., 229.]"
                },
                "quantiles": {
                    "0": 67537.8152,
                    "1": 145109.1429,
                    "2": 190766.3096
                },
                "median_abs_deviation": 60642.4089,
                "num_zeros": 0,
                "num_negatives": 0,
                "times": {
                    "min": 0.0033,
                    "max": 0.0026,
                    "sum": 0.0018,
                    "variance": 0.0115,
                    "skewness": 0.4415,
                    "kurtosis": 0.3778,
                    "histogram_and_quantiles": 0.3035,
                    "num_zeros": 0.0023,
                    "num_negatives": 0.0025
                },
                "unique_count": 115207,
                "unique_ratio": 0.0662,
                "categories": "['188991', '10000', '215', ... , '104165', '194483', '226577']",
                "gini_impurity": 0.9988,
                "unalikeability": 0.9988,
                "categorical_count": {
                    "188991": 22211,
                    "10000": 17795,
                    "215": 12911,
                    "95645": 12336,
                    "185": 10517,
                    "160835": 10334,
                    "208784": 10123,
                    "92093": 9906,
                    "151451": 8560,
                    "186": 8320,
                    ...
                    and on and on and on...

However, if I take out 5 columns that are marked as categorical that are particularly large in terms of the number of unique items and re-run my simple test now with just 35 columns, I get to 400 seconds, cutting my time in half.

Is it possible via the options to override this to prevent it from hanging? I can turn off the chi2_homogeneity via the options. What are your thoughts on putting some kind of ceiling on when something is considered categorical or not?

Thank you

carlsonp commented 6 months ago

And it looks like you have it already via max_sample_size_to_check_stop_condition and stop_condition_unique_value_ratio.

profile_options.set({
        "structured_options.data_labeler.is_enabled": False,
        "unstructured_options.data_labeler.is_enabled": False,
        "structured_options.correlation.is_enabled": False,
        "structured_options.multiprocess.is_enabled": True,
        "structured_options.chi2_homogeneity.is_enabled": False,
        "structured_options.category.max_sample_size_to_check_stop_condition": 1,
        "structured_options.category.stop_condition_unique_value_ratio": 0.001,
        "structured_options.sampling_ratio": 0.3,
        "structured_options.null_replication_metrics.is_enabled": False
    })

There doesn't appear to be a value set for these two values according to the code. It seems like it would be prudent to put a default in here?

taylorfturner commented 6 months ago

Hey @carlsonp -- first thanks for the detailed notes and documentation here.