loadDiscreteData function is extremely slow when 'alarm dataset' (20000 samples) used

yasu-sh commented 2 years ago

I am using alarm dataset of bnlearn and the synthetic discrete data, which has 20000 samples. The reading time is about 60secs and causal discovery time is about 2-3 secs.

I feel like to improve the time consuming situation and ideas:

Not using sapply at entire sample sizes, but factor function can be used.
Reduces on-site memory read-write. matrix object is used for batch data conversion.
Avoids using '$' character and Uses .jcall function since this reflect method is time consuming.
Disables console output

it would be making good result.

jdramsey commented 1 year ago

BTW I'm pretty sure this will be much faster using py-tetrad in R...

yasu-sh commented 1 year ago

@jdramsey Yes! This issue was already solved by my pull-request last year.

The selection of R's implementation is important since the performance changes a lot. R has lazy evaluation and we need to check where is hot-spots. https://github.com/bd2kccd/r-causal/blob/fc370245e938d5a2cb6e6abd4548bf7107fdd1dc/R/tetrad_utils.R#L336

jdramsey commented 1 year ago

Ah, thanks!

bd2kccd / r-causal

loadDiscreteData function is extremely slow when 'alarm dataset' (20000 samples) used #106