imagine: performant sparse data

jaffee commented 5 years ago

Currently, imagine has difficulty performantly generating data with lots of rows (1000+), even if bits are set very sparsely within those rows. It should be possible to do this pretty performantly using a different strategy for generation.

Instead of trying to decide this based on the data distribution, I think it would be a good first step to have a flag for tasks which should use this generator rather than the default one.

seebs commented 5 years ago

When I was looking at this a while back, one of the ideas someone suggested was to use a Poisson distribution to figure out how many bits should be set in some larger region, then pick those bits some other way. The permutation functionality should be fine for picking a handful of bits in a range -- the first M values from a permutation of N items is a reasonable way to represent "M out of N bits set", for instance. The permutation thing is sorta slow if you're generating all the bits, but if you only want the first few, it's probably extremely fast compared to evaluating every bit at very low probability.

travisturner commented 5 years ago

Fixed by #88

FeatureBaseDB / tools

imagine: performant sparse data #83