fastlmm / FaST-LMM

Python version of Factored Spectrally Transformed Linear Mixed Models
https://fastlmm.github.io/
Apache License 2.0
47 stars 11 forks source link

single_snp with large number of SNPs #51

Open snowformatics opened 4 months ago

snowformatics commented 4 months ago

Hi Carl,

I have some datasets with > 5 Million SNPs (but < 500 samples and 1 phenotype). The run with single_snp takes more then 8 hours. Is there any way to speed up things?

Thanks

CarlKCarlK commented 4 months ago

Greetings,

Yes, (probably, kind of)

After I get more information, I can give you more details, but the general idea is:

From: snowformatics @.> Sent: Tuesday, July 16, 2024 11:35 PM To: fastlmm/FaST-LMM @.> Cc: Subscribed @.***> Subject: [fastlmm/FaST-LMM] single_snp with large number of SNPs (Issue #51) Importance: High

Hi Carl,

I have some datasets with > 5 Million SNPs (but < 500 samples and 1 phenotype). The run with single_snp takes more then 8 hours. Is there any way to speed up things?

Thanks

- Reply to this email directly, view it on GitHubhttps://github.com/fastlmm/FaST-LMM/issues/51, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABR65PYZIQDYBMLA4IHTJETZMYGC5AVCNFSM6AAAAABK76LFGSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQYTENZXG4ZDENQ. You are receiving this because you are subscribed to this thread.Message ID: @.**@.>>

snowformatics commented 4 months ago

Hi Carl,

thanks for the replay, here are more details:

I have access to GPU and a Slurm cluster.

Thanks

CarlKCarlK commented 4 months ago

Below is some python code that will let you run the single_snp in parts. Put the code in a file such as run_in_parts.py. Change the name of the input files (and add a covarate file if you use one).

To divide the work into 1000 parts and run part index 0, you do:

python run_in_parts.py 0 1000

This will produce output file result.0of1000.tsv. I'd suggest doing this for a several indexes on your local machine to check that it works and runs a part in a reasonable amount of time.

I don't know that much about Slurm, but the idea is to run this on the cluster. Set the # of parts to something reasonable for the cluster (maybe 10, or 20, or 100 or 1000 [it really depends on the cluster's policies]). Be sure the input and output files will work for the cluster (by having them as some shared space). You also need to have a way to have fastlmm installed on the cluster for your job. Then you get slurm run some jobs. For example, if you want to run in 10 parts, then you'd want 10 slurm tasks of:

python run_in_parts.py 0 10
python run_in_parts.py 1 10
python run_in_parts.py 2 10
python run_in_parts.py 3 10
python run_in_parts.py 3 10
python run_in_parts.py 4 10
python run_in_parts.py 5 10
python run_in_parts.py 6 10
python run_in_parts.py 7 10
python run_in_parts.py 8 10
python run_in_parts.py 9 10

And the output will be 10 tabbed output files that you could merge and sort to see the results. [with Pandas or other tools]

import argparse
from fastlmm.association import single_snp
from pysnptools.snpreader import Bed

def main(part_count, part_index):
    # File paths
    bed_file = r"O:\programs\pysnptools\pysnptools\examples\toydata.bed"
    pheno_file = r"O:\programs\pysnptools\pysnptools\examples\toydata.phe"

    # Load the BED and phenotype data
    bed = Bed(bed_file, count_A1=True)

    # Calculate start and end indexes for the current part
    snp_start = bed.sid_count * part_index // part_count
    snp_end = bed.sid_count * (part_index + 1) // part_count

    # Slice the BED file to get the SNPs for the current part
    test_snps = bed[:, snp_start:snp_end]

    # Perform single SNP association test and save results to a file
    output_file_name = f"result.{part_index}of{part_count}.tsv"
    single_snp(test_snps=test_snps, pheno=pheno_file, K0=bed, output_file_name=output_file_name)
    print(f"Results saved to {output_file_name}")

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Run single SNP association test on a partition of SNPs.')
    parser.add_argument('part_index', type=int, help='Index of the current part (0-based).')
    parser.add_argument('part_count', type=int, help='Total number of parts to divide the SNP data into.')

    args = parser.parse_args()

    main(args.part_count, args.part_index)
CarlKCarlK commented 4 months ago

I wrote a reply but accidently sent to unfinished. So, be sure to read it on github for the final version.

From: snowformatics @.> Sent: Thursday, July 18, 2024 10:05 PM To: fastlmm/FaST-LMM @.> Cc: Carl Kadie @.>; Comment @.> Subject: Re: [fastlmm/FaST-LMM] single_snp with large number of SNPs (Issue #51) Importance: High

Hi Carl,

thanks for the replay, here are more details:

I have access to GPU and a Slurm cluster.

Thanks

- Reply to this email directly, view it on GitHubhttps://github.com/fastlmm/FaST-LMM/issues/51#issuecomment-2238165871, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABR65PYTOP37I3JVHCQQNWTZNCM7XAVCNFSM6AAAAABK76LFGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZYGE3DKOBXGE. You are receiving this because you commented.Message ID: @.**@.>>

snowformatics commented 4 months ago

Thanks a lot Carl, I will test it!