aehrc / VariantSpark

machine learning for genomic variants
http://bioinformatics.csiro.au/variantspark
Other
140 stars 45 forks source link

optimise VariantSpark for large sample size (n>50K) #204

Open natwine opened 3 years ago

natwine commented 3 years ago

VariantSpark is currently optimised for reasonally small sample sizes (n=100-5000) and large numbers of variants (e.g. 42 million) , ie. 'wide' datasets. Working on phenotypes in UKBB, e.g. CAD we have samples sizes of ~50K at our disposal and VariantSpark has a long run time ( ~3day) when dealing with such sample sizes. As we expect genomic cohorts to grow in size it is worth considering how we can optimise VariantSpark for larger sample sizes (50K plus).

DavidB-XI commented 2 years ago

This work is inspiring, great method and deployment model!

You could always apply the idea of summarising the total number of samples into a reduced dimension, predicting in the reduced sample space, and then applying the learned parameters to predict the original variable.

This could prove useful for the VariantSpark method that works on millions of features, yet takes longer with tens of thousands of samples.

If you like this idea, I've implemented a method that finds an encoding of the sample space, reduces the samples enough to carry out a faster and more efficient regression, and then unfolds the prediction to make it seem as though it ran on the full sample space.

You can find this method here: https://github.com/AskExplain/summary_sampling_via_folding/blob/main/prediction_using_fold_sampling.pdf

I've tried to run it using the sample 1000 Genomes dataset, but run into errors when installing the actual library on my local machine, so I can't apply this idea myself with VariantSpark unfortunately.

If you need help with translating the code to CSV / VCF files, let me know in this issue thread. If it works, let me know here too - would be great to work on this with the team!