Open snowformatics opened 4 months ago
Greetings,
Yes, (probably, kind of)
After I get more information, I can give you more details, but the general idea is:
(optional) Precompute the similarity matrix for each left-one-out chromosome
Run different parts of the test SNPs on different processors on one machine
Run different parts of the test SNPs on different machines.
Carl
From: snowformatics @.> Sent: Tuesday, July 16, 2024 11:35 PM To: fastlmm/FaST-LMM @.> Cc: Subscribed @.***> Subject: [fastlmm/FaST-LMM] single_snp with large number of SNPs (Issue #51) Importance: High
Hi Carl,
I have some datasets with > 5 Million SNPs (but < 500 samples and 1 phenotype). The run with single_snp takes more then 8 hours. Is there any way to speed up things?
Thanks
- Reply to this email directly, view it on GitHubhttps://github.com/fastlmm/FaST-LMM/issues/51, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABR65PYZIQDYBMLA4IHTJETZMYGC5AVCNFSM6AAAAABK76LFGSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQYTENZXG4ZDENQ. You are receiving this because you are subscribed to this thread.Message ID: @.**@.>>
Hi Carl,
thanks for the replay, here are more details:
I have access to GPU and a Slurm cluster.
Thanks
Below is some python code that will let you run the single_snp in parts. Put the code in a file such as run_in_parts.py
. Change the name of the input files (and add a covarate file if you use one).
To divide the work into 1000 parts and run part index 0, you do:
python run_in_parts.py 0 1000
This will produce output file result.0of1000.tsv
. I'd suggest doing this for a several indexes on your local machine to check that
it works and runs a part in a reasonable amount of time.
I don't know that much about Slurm, but the idea is to run this on the cluster. Set the # of parts to something reasonable for the cluster (maybe 10, or 20, or 100 or 1000 [it really depends on the cluster's policies]). Be sure the input and output files will work for the cluster (by having them as some shared space). You also need to have a way to have fastlmm installed on the cluster for your job. Then you get slurm run some jobs. For example, if you want to run in 10 parts, then you'd want 10 slurm tasks of:
python run_in_parts.py 0 10
python run_in_parts.py 1 10
python run_in_parts.py 2 10
python run_in_parts.py 3 10
python run_in_parts.py 3 10
python run_in_parts.py 4 10
python run_in_parts.py 5 10
python run_in_parts.py 6 10
python run_in_parts.py 7 10
python run_in_parts.py 8 10
python run_in_parts.py 9 10
And the output will be 10 tabbed output files that you could merge and sort to see the results. [with Pandas or other tools]
import argparse
from fastlmm.association import single_snp
from pysnptools.snpreader import Bed
def main(part_count, part_index):
# File paths
bed_file = r"O:\programs\pysnptools\pysnptools\examples\toydata.bed"
pheno_file = r"O:\programs\pysnptools\pysnptools\examples\toydata.phe"
# Load the BED and phenotype data
bed = Bed(bed_file, count_A1=True)
# Calculate start and end indexes for the current part
snp_start = bed.sid_count * part_index // part_count
snp_end = bed.sid_count * (part_index + 1) // part_count
# Slice the BED file to get the SNPs for the current part
test_snps = bed[:, snp_start:snp_end]
# Perform single SNP association test and save results to a file
output_file_name = f"result.{part_index}of{part_count}.tsv"
single_snp(test_snps=test_snps, pheno=pheno_file, K0=bed, output_file_name=output_file_name)
print(f"Results saved to {output_file_name}")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Run single SNP association test on a partition of SNPs.')
parser.add_argument('part_index', type=int, help='Index of the current part (0-based).')
parser.add_argument('part_count', type=int, help='Total number of parts to divide the SNP data into.')
args = parser.parse_args()
main(args.part_count, args.part_index)
I wrote a reply but accidently sent to unfinished. So, be sure to read it on github for the final version.
From: snowformatics @.> Sent: Thursday, July 18, 2024 10:05 PM To: fastlmm/FaST-LMM @.> Cc: Carl Kadie @.>; Comment @.> Subject: Re: [fastlmm/FaST-LMM] single_snp with large number of SNPs (Issue #51) Importance: High
Hi Carl,
thanks for the replay, here are more details:
I have access to GPU and a Slurm cluster.
Thanks
- Reply to this email directly, view it on GitHubhttps://github.com/fastlmm/FaST-LMM/issues/51#issuecomment-2238165871, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABR65PYTOP37I3JVHCQQNWTZNCM7XAVCNFSM6AAAAABK76LFGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZYGE3DKOBXGE. You are receiving this because you commented.Message ID: @.**@.>>
Thanks a lot Carl, I will test it!
Hi Carl,
I have some datasets with > 5 Million SNPs (but < 500 samples and 1 phenotype). The run with single_snp takes more then 8 hours. Is there any way to speed up things?
Thanks