capitalone / DataProfiler

What's in your data? Extract schema, statistics and entities from datasets
https://capitalone.github.io/DataProfiler
Apache License 2.0
1.41k stars 157 forks source link

Multiprocessing and Performance Improvements #1117

Open carlsonp opened 5 months ago

carlsonp commented 5 months ago

This is related to some of the discussion in #1098

In my testing, I have a single dataset. I am running this in a Docker container. I'm running with the following settings:

import sys
import json
import time
import dataprofiler as dp

filename = "myfile.parquet"

def profile_test(filename):
    data = dp.Data(filename)
    profile_options = dp.ProfilerOptions()

    profile_options.set({
        "structured_options.data_labeler.is_enabled": False,
        "unstructured_options.data_labeler.is_enabled": False,
        "structured_options.correlation.is_enabled": False,
        "structured_options.multiprocess.is_enabled": True,
        "structured_options.chi2_homogeneity.is_enabled": False,
        "structured_options.category.max_sample_size_to_check_stop_condition": 1,
        "structured_options.category.stop_condition_unique_value_ratio": 0.001,
        "structured_options.sampling_ratio": 0.3,
        "structured_options.null_replication_metrics.is_enabled": False
    })

    print(profile_options)

    profile = dp.Profiler(data, options=profile_options)

    human_readable_report = profile.report(report_options={"output_format":"pretty"})

    with open("reportfile.json", "w") as outfile:
        outfile.write(json.dumps(human_readable_report, indent=4))

start_time = time.time()
profile_test(filename)
end_time = time.time()

print("Profile runtime for "+filename, end_time-start_time, 'seconds')

When Data Profiler gets to the first tqdm loop and displays Finding the Null values in the columns... it's pretty quick. It also lists 19 processes corresponding to the pool_size available in the Python multiprocessing pool. This works fine.

Then when it gets to the second tqdm loop and displays Calculating the statistics... I noticed that it was only using 4 processes. When I looked at what was running, I am only seeing a single core being used. When I looked at the code, profile_builder.py has 4 hard-coded. This doesn't seem right. There's a utility function profiler_utils.suggest_pool_size that's not even used anywhere in the codebase as far as I can tell that returns the pool size. So I swapped that out. Now when I run, instead of 4 processes it shows 19 so that seems better. At least we're not leaving potential performance on the table with hard-coding.

However, I'm still seeing only a single core being used. I also checked the CPU affinity after reading some comments on Stackoverflow. It looks reasonable to me.

print(f"Affinity: {os.sched_getaffinity(os.getpid())}")

I'm going to try profiling a bit more and see if I can figure out where it's hanging. It seems like it should be faster, particularly on a multi-core machine. Calculating all the statistics and stuff is taxing but it seems like it should be faster.

taylorfturner commented 5 months ago

I'm going to try profiling a bit more and see if I can figure out where it's hanging.

Testing is welcome @carlsonp -- I'll keep on comments here and in #1098

carlsonp commented 5 months ago

I made a bit more progress in understanding what's going on. No solutions yet though. Maybe someone will have suggestions. The profiling via snakeviz looks like this:

snakeviz-profile

It's iterating through the profile_types and adding work to the pool via apply_async. This is fine, but then it seems to immediately go into a for loop blocking via get and waiting for those jobs to finish. This seems to result in small batches of 2-4 jobs kicking off then waiting until they're all finished before progressing with another set of 2-4 jobs. I can sort of see now why the pool size of 4 was hard-coded. If I try to comment that out and have it run all out on all the jobs and then just wait for that work to complete, I get incomplete and bad results. Maybe these really can't be parallelized any more and this is the best we can get but... you all know the codebase much more than myself. Any ideas?

taylorfturner commented 2 months ago

you'll want a rebase onto dev here