fastlmm / FaST-LMM

Python version of Factored Spectrally Transformed Linear Mixed Models
https://fastlmm.github.io/
Apache License 2.0
47 stars 11 forks source link

Using FaST-LMM single_snp() in a cluster server #45

Closed Jorge-Hernansanz closed 3 months ago

Jorge-Hernansanz commented 4 months ago

Hi,

I want to use the FaST-LMM tool for associations with lipidomics data. Even though the number of individuals (~300) and lipids (~100) is relatively small, we want to test against a big number of SNPs (~1000000).

Does the _singlesnp() function parallerize the workload when running in a cluster with several processors available for the job?

Best regards, Jorge

CarlKCarlK commented 4 months ago

Jorge,

Thanks for using FaST-LMM!

I have some questions and then (I think) an answer.

FaST-LMM has several ways to approach cluster runs, however, because your similarity matrix will just be 300x300 we should be able to do something easy, namely, just divide up the test SNPs using PySnpTools (a sister package to FaST-LMM).

Here is a test I did on 300 individuals x 333,333 SNPs and 100 phenotypes. I was able to process 1/50th of the data in 14 seconds.

import numpy as np
from pysnptools.snpreader import Bed
from pysnptools.snpreader import SnpData
from fastlmm.association import single_snp

# Trim a bed file to 300x333333
bed = Bed(some_bed, count_A1=True)[:300,:333333]
print(bed.shape) # prints (300, 333333)

# create a 300 x 100 random phenotype values - no values are missing
pheno = SnpData(iid=bed.iid, sid=["pheno_{0}".format(i) for i in range(100)], val=np.random.randn(300,100))
print(pheno.shape) # prints (300, 100)

# divide the test snps into 50 chunks and run chunk 22
chunk_count = 50
chunk_index = 22

chunk_size = bed.sid_count // chunk_count
chunk_start = chunk_size * chunk_index
chunk_end = chunk_start + chunk_size
test_snps = bed[:,chunk_start:chunk_end]

results_df = single_snp(test_snps=test_snps, K0=bed, pheno=pheno, count_A1=False, output_file_name=f"results_{chunk_index}_of_{chunk_count}.txt")
results_df
# runs in about 14 seconds on my machine

Let me know if something like this works for you. If not, there are several more steps that are possible.

-- Carl

Carl Kadie, Ph.D. FaST-LMM & PySnpTools Team (Microsoft Research, retired) https://www.linkedin.com/in/carlk/ Join the FaST-LMM user discussion and announcement list via email (or use web sign up)

Jorge-Hernansanz commented 4 months ago

Hi carl,

Sorry not replying before.

Thanks for the suggestion. I implemented your code with my data and it took 22 seconds. When allowing for all SNPs, it only took around 50 minutes so, surprised it is that fast.

I obtained these times excluding the part where I write the results to a file. When including the command to write the data (With the built-in parameter you use, or the pandas function _pd.tocsv()), times are significantly increased. For 1/50 of the snps, it takes around 38 seconds. If I were to select 1/10 of the snps, then the runtime is 51 secs (without writing to file) and 120 secs (writing to file). With the whole set of snps, the runtime goes up to 90 minutes.

Is there any explanation of why this is happening?

CarlKCarlK commented 4 months ago

Great! I appreciate that your first note included the size of your data so that I could estimate the timing.

Yes, my guess is that it's just the time to write a lot of text to disk. Ideas for improvement:

What is your computing setup? Is 90 minutes tolerable?

-- Carl

Jorge-Hernansanz commented 3 months ago

Yeah 90 minutes is more than okay.

Thanks for all the help!

CarlKCarlK commented 3 months ago

You're welcome. Thanks for using FaST-LMM!