optimise VariantSpark for large sample size (n>50K)

This work is inspiring, great method and deployment model!

You could always apply the idea of summarising the total number of samples into a reduced dimension, predicting in the reduced sample space, and then applying the learned parameters to predict the original variable.

This could prove useful for the VariantSpark method that works on millions of features, yet takes longer with tens of thousands of samples.

If you like this idea, I've implemented a method that finds an encoding of the sample space, reduces the samples enough to carry out a faster and more efficient regression, and then unfolds the prediction to make it seem as though it ran on the full sample space.

You can find this method here: https://github.com/AskExplain/summary_sampling_via_folding/blob/main/prediction_using_fold_sampling.pdf

I've tried to run it using the sample 1000 Genomes dataset, but run into errors when installing the actual library on my local machine, so I can't apply this idea myself with VariantSpark unfortunately.

If you need help with translating the code to CSV / VCF files, let me know in this issue thread. If it works, let me know here too - would be great to work on this with the team!

aehrc / VariantSpark

optimise VariantSpark for large sample size (n>50K) #204