aehrc / VariantSpark

machine learning for genomic variants
http://bioinformatics.csiro.au/variantspark
Other
141 stars 45 forks source link

run on single chromosomes? #224

Closed sshanshans closed 2 years ago

sshanshans commented 2 years ago

My collaborator and I would like to use Variant Spark to identify genetic variants associated with colorectal cancer risk. Our dataset is about 350.000 individuals from UK Biobank and more than 10 million genetic variants. Since this is a very big dataset, we were wondering if you have any recommendations regarding how to set up the datafile and the configurations in the computing clusters to make the program scale.

More precisely, we are now able to run Variant Spark for the 350.000 individuals with genetic variants from chromosome 21, which has the least number of genetic variants. If we want to include variants from all chromosomes, should we make a single file that contains all 10 million genetic variants or could we run the program on each chromosome and then combine the analysis later. Which would you recommend?

rocreguant commented 2 years ago

Hi Shan, You don't need to make a unique file, you can merge them during running time using hail commands before running VariantSpark (VS).

From your description that's probably going to require a lot of RAM memory, also the more (fast) CPUs the better.

If your cluster allows, it would be most beneficial to run everything at once. By running it all together VS will be able to extract extra information, like cross-chromosome epistatic interactions, that otherwise could not. However, computers have physical limitations so the second best way is to run it chromosome by chromosome and aggregate the analysis.

Another way would be to prune the dataset. You could remove highly correlated mutations using linkage disequilibrium. This way you'd have a "cleaner" dataset. Also, You could use VS on a two-step process. First, each individual chromosome to remove all variants that have no to little importance, and then, run the complete dataset with the variants that passed the importance threshold.

I hope that helps :)