DataResponsibly / DataSynthesizer

MIT License
257 stars 85 forks source link

Time issues with DataSynthesizer #31

Open mahmoudibrahim98 opened 3 years ago

mahmoudibrahim98 commented 3 years ago

Description

Hello, I am using DataSynthesizer to generate synthetic data for research purposes. I've been using this package for moths and it works perfectly with small datasets. However, when I use a bigger dataset, especially higher number of columns, time problem rises. A single dataset(with 71236 instances and 52) took more than 18 hours to be synthesized on a 64 core machine(degree_of_bayesian_network =0 in this case) . I also tried to decrease the degree_of_bayesian_network , by assigning it to 2 instead of the default 0. Although the quality of the synthesized data decreases, Time decreases , but it's still taking too long. What do you suggest to do? Is there a better way you recommend to approach bigger datasets?

What I Did

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.
mahmoudibrahim98 commented 3 years ago

Also, can you point out for me the effect of using k=3 over the effect of using k=2.

haoyueping commented 3 years ago

Please try k=1. k is the number of parents for nodes in the constructed Bayesian network. The running time / complexity of DataSynthesizer increases dramatically with k.

When k=0, its value will be self-determined, which could be very large.