fastlmm / FaST-LMM

Python version of Factored Spectrally Transformed Linear Mixed Models
https://fastlmm.github.io/
Apache License 2.0
47 stars 11 forks source link

GenDist is NaN after running single_snp analysis #11

Closed lucasmiranda42 closed 3 years ago

lucasmiranda42 commented 3 years ago

Hi! First of all, thank you for such a great tool! After using it a few times and getting the expected results, I now stumbled upon a dataset from which I get NaN values for all genetic distances. I'm running the code below, independently, for each chromosome of an imputed dataset. As K0 I'm passing the full unimputed genome for the same set of samples. The code is part of a SnakeMake pipeline; input.imp_bed and input.dst_bed are strings with the names of the mentioned testing and distance bed files, respectively. I checked all inputs for missing data, malformations, etc. and nothing I came up with seemed to be the case. I don't think it's a bug, but rather a mistake on my side. Is there any obvious hint? Thanks a lot!

snpdata = Bed(input.imp_bed)[:,::2].read() snpdist = Bed(input.dst_bed)[:,::2].read() results_df = single_snp(test_snps=snpdata, pheno=input.phenos, K0=snpdist, count_A1=False, leave_out_one_chrom=True) results_df.to_csv(output.gwas, sep="\t")

CarlKCarlK commented 3 years ago

Lucas,

Thanks for using FaST-LMM and PySnpTools. (I see and appreciate that you’re using the slicing features of PySnpTools, e.g. “[:,::2]”)

Summary:

It may just mean that your Bed file has some genetic distances set to unknown.

Details:

We recently changed the Bed reader to be consistent with the other SnpReaders (https://fastlmm.github.io/PySnpTools/#snpreader-snpreader). I suspect that change may be giving you unexpected results. Specifically, in a BED file if a genetic distance is unknown, it should be set to “0”. However, other SnpReaders don’t follow that standard, so now the Bed reader changes any “0”’s to NaN.

Please check your Bed file for missing genetic distance values. Some example code is below. (If your *.bim file is not too big, you can also open it in a text editor and look for “0”’s in the 3rd column.)

import numpy as np from pysnptools.snpreader import Bed from pysnptools.util import example_file # Download and return local file name

bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.",".bed")

bed = Bed(bedfile) nans = np.isnan(bed.pos)

for i,what in enumerate(["chromosome","genetic distance", "basepair distance"]): print(f"{nans[:,i].sum()} missing values of {nans.shape[0]} in {what}") output: 0 missing values of 1015 in chromosome 2 missing values of 1015 in genetic distance 17 missing values of 1015 in basepair distance

Please, let me know if this takes care of the problem.

Yours, Carl

Carl Kadie, Ph.D. FaST-LMM & PySnpTools Teamhttps://fastlmm.github.io/ (Microsoft Research, retired) https://www.linkedin.com/in/carlk/

Join the FaST-LMM user discussion and announcement list via emailmailto:fastlmm-user-join@python.org?subject=Subscribe (or use web sign uphttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.python.org%2Fmailman3%2Flists%2Ffastlmm-user.python.org&data=02%7C01%7C%7C13a5c33d7cd84cad5cdf08d7bba56e20%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637184191498409587&sdata=2CQWjQEwOpQol2rQ1eoyVTgY8WvInV8UH31Wtl68FzY%3D&reserved=0)

From: Lucas Miranda notifications@github.com Sent: Saturday, December 12, 2020 12:48 PM To: fastlmm/FaST-LMM FaST-LMM@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [fastlmm/FaST-LMM] GenDist is NaN after running single_snp analysis (#11)

Hi! First of all, thank you for such a great tool! After using it a few times and getting the expected results, I now stumbled upon a dataset from which I get NaN values for all genetic distances. I'm running the code below, independently, for each chromosome of an imputed dataset. As K0 I'm passing the full unimputed genome for the same set of samples. The code is part of a SnakeMake pipeline; input.imp_bed and input.dst_bed are strings with the names of the mentioned testing and distance bed files, respectively. I checked all inputs for missing data, malformations, etc. and nothing I came up with seemed to be the case. I don't think it's a bug, but rather a mistake on my side. Is there any obvious hint? Thanks a lot!

snpdata = Bed(input.imp_bed)[:,::2].read() snpdist = Bed(input.dst_bed)[:,::2].read() results_df = single_snp(test_snps=snpdata, pheno=input.phenos, K0=snpdist, count_A1=False, leave_out_one_chrom=True) results_df.to_csv(output.gwas, sep="\t")

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffastlmm%2FFaST-LMM%2Fissues%2F11&data=04%7C01%7C%7C127207636d354e8963f808d89edf4752%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637434029090252591%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=1osSdDMQCAo%2FuPn2BR1aReZ0BXQherqzueN3EQDpHBU%3D&reserved=0, or unsubscribehttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65P44OSFRYKYJ6DFLIM3SUPJJXANCNFSM4UYT367Q&data=04%7C01%7C%7C127207636d354e8963f808d89edf4752%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637434029090262582%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=MhYUz10Wdm7%2FkKlaOLBwm1aFhZD5pYBGBqbvobep1m8%3D&reserved=0.

lucasmiranda42 commented 3 years ago

Dear Carl, Thank you very much for your quick answer! I'll try out your suggestions this afternoon and let you know. Best and have a nice day, Lucas

lucasmiranda42 commented 3 years ago

Dear Carl,

I checked and all .bim files in the dataset indeed had a dummy value of '0' in the 3rd column. Adding genetic distance annotations with plink fixed the issue.

Thank you very much for your help! Lucas

CarlKCarlK commented 3 years ago

Lucas,

Sadly, the single_snp function just reports the values it finds in the *.bim file.

I wonder if you could look up the genetic distance from the SNP name. For example, if you know you have (human) rs6313, there seem to be on-line references (e.g. rs6313 - Wikipediahttps://en.wikipedia.org/wiki/Rs6313, rs6313 - SNPediahttps://www.snpedia.com/index.php/Rs6313) that give at least the position and I imagine genetic distance.

p.s. Also see these threads. Conversion of physical position (bp) of SNPs to genetic postion (cM) (researchgate.net)https://www.researchgate.net/post/Conversion-of-physical-position-bp-of-SNPs-to-genetic-postion-cM

From: Lucas Miranda notifications@github.com Sent: Monday, December 14, 2020 10:17 PM To: fastlmm/FaST-LMM FaST-LMM@noreply.github.com Cc: Carl Kadie carlk@msn.com; Comment comment@noreply.github.com Subject: Re: [fastlmm/FaST-LMM] GenDist is NaN after running single_snp analysis (#11)

Dear Carl,

I checked and absolutely all .bim files in the dataset have a dummy value of '0' in the 3rd columns of the .bim file. Does this prevent me of computing genetic distance during the single_snp analysis? Is there any workaround?

Thank you very much! Lucas

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffastlmm%2FFaST-LMM%2Fissues%2F11%23issuecomment-745078809&data=04%7C01%7C%7Cff316aa57d254786786a08d8a0c110fd%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637436098353266391%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=wYXtYb3sf2OPnOKbtnc%2FwnSaZprbWWOA64smdzI%2Bw5k%3D&reserved=0, or unsubscribehttps://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65P55VQZWQFYVURYF5MLSU35OTANCNFSM4UYT367Q&data=04%7C01%7C%7Cff316aa57d254786786a08d8a0c110fd%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637436098353276383%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=fD7J65KcADPvzEFNlE08xsKrpeypykCbC2gCUI5iMQg%3D&reserved=0.