Closed snowformatics closed 7 months ago
Yes, just PValue, no transformations.
From: snowformatics @.> Sent: Thursday, April 11, 2024 10:19 PM To: fastlmm/FaST-LMM @.> Cc: Subscribed @.***> Subject: [fastlmm/FaST-LMM] PValue Output Format (Issue #39) Importance: High
Hi,
I have a quick question regarding the output format for the PValue from single_snp and single_snp_linreg. To confirm, the output provided is the raw PValue itself, not its -log10 transformation, correct?
Thanks Stefanie
- Reply to this email directly, view it on GitHubhttps://github.com/fastlmm/FaST-LMM/issues/39, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABR65PY6P2HZKKVZQ4HSDTDY45VENAVCNFSM6AAAAABGDPQBPOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGIZTSMJQGYYTSOI. You are receiving this because you are subscribed to this thread.Message ID: @.**@.>>
@CarlKCarlK I seem to be getting negative Pvalues ... any idea how can that happen ? Thanks!
@asifzubair When using single_snp or something else? Numbers like -1 or -2 or more "random" like -0.00045?
Yes, I'm using single_snp
. I get something like -45 .. could it be underflow/overflow ? I can find the exact output and paste it here later.
Yes, the more info the better. Also, please tell me (remind me) of the approximate number of individuals, SNPs, & phenotypes (usually 1) and covariates.
Hi @CarlKCarlK , here is the output of the single_snp
:
> results_df[["Chr", "ChrPos", "PValue"]]
Chr | ChrPos | PValue
-- | -- | --
4.0 | 0.000039 | -47.907086
3.0 | 0.000040 | 47.885116
4.0 | 0.000051 | -46.647437
4.0 | 0.000051 | -46.647437
4.0 | 0.000065 | -46.566270
... | ... | ...
8.0 | 0.999922 | 0.001081
4.0 | 0.999939 | 0.000921
3.0 | 0.999946 | 0.000777
3.0 | 0.999951 | -0.000716
13.0 | 0.999987 | -0.000197
I don't have covariates, but here are other variables:
PhenotypeName bwt
SampleSize 513
SNPCount 79569
Asif,
Thanks for the additional info!
I did a run on the same size of data:
import numpy as np
from fastlmm.association import single_snp
from fastlmm.util import example_file # Download and return local file name
from pysnptools.snpreader import Bed
# set up data
##############################
from fastlmm.util import example_file # Download and return local file name
bed_fn = # redacted
pheno_fn = # redacted
bed = Bed(bed_fn,count_A1=False)[::28,::5] #this sampling makes my data 534 x 71289
# display(bed.iid_count,bed.sid_count)
# run gwas
###################################################################
results_df = single_snp(bed, pheno_fn, count_A1=False)
print(results_df[["Chr", "ChrPos", "PValue"]])
30 seconds of processing on a face machine produces
Chr ChrPos PValue
0 3.0 3197918.0 0.000005
1 10.0 72900091.0 0.000028
2 1.0 199945630.0 0.000038
3 5.0 86013053.0 0.000044
4 4.0 113240529.0 0.000046
... ... ... ...
71284 5.0 61832737.0 0.999965
71285 12.0 109750626.0 0.999965
71286 22.0 37902926.0 0.999969
71287 6.0 89914187.0 0.999974
71288 3.0 35578596.0 0.999976
Two things I notice. I expect the ChrPos to be integers. I expect the PValue to be sorted. Any idea why this is not so in your output?
One more thing: I used a beta version FastLmm that works better with the latest version of Numpy and SciPyi: pip install fastlmm==0.6.8b1
Hi @CarlKCarlK , thank you for taking a look at this. Yes, I was confused about the ChrPos
as well. I thought it had something to do with how fastlmm
ingests data, but perhaps there is something up with my input data. Does ChrPos
matter when we construct the Kinship ? Or it is only used for marker identification ?
Great, I'll try to install the beta version... I have been having some installation errors, but hopefully won't see in beta version 👍
ChrPos is not used but it's weird value maybe a sign of other problems.
Does ChrPos matter when we construct the Kinship By default, we leave out the whole chromosome on which we are testing, so the chrom number matters, but not the position within the chromosome.
Yes, I tried with the beta version and got the same results. I was wondering where the chromosome positions are picked up from. This is how the bim
file looks like:
% head plink_assoc.bim
1 SSUPER_SCAFFOLD_1000192_18306 0 18306 A G
1 SSUPER_SCAFFOLD_1000192_18402 0 18402 T A
1 SSUPER_SCAFFOLD_1000192_18433 0 18433 C G
1 SSUPER_SCAFFOLD_1000192_18437 0 18437 G T
1 SSUPER_SCAFFOLD_1000192_18468 0 18468 C G
1 SSUPER_SCAFFOLD_1000192_18498 0 18498 G A
1 SSUPER_SCAFFOLD_1000192_35325 0 35325 C A
1 SSUPER_SCAFFOLD_1000192_35334 0 35334 T G
1 SSUPER_SCAFFOLD_1000192_35392 0 35392 G A
1 SSUPER_SCAFFOLD_1000192_48497 0 48497 T C
This is total of my calling script:
import fastlmm.association as association
BED_FP = "../analysis_filtered_genotypes_2/plink_assoc_geno_0.1_maf_0.01.bed"
phenos = ["bno", "bwt", "abwt"]
for p in phenos:
PHENO_FP = "../analysis_filtered_genotypes_2/max_pheno_normalised_" + p + ".pheno"
results_df = association.single_snp(BED_FP, PHENO_FP, count_A1=True)
results_df.to_csv("fastlmm_assoc_max_pheno_" + p + ".assoc", sep="\t", index=False, header=True)
I'm just confused how p-values can be negative ... shouldn't these be constrained to be in [0, 1] ?
You are correct. The pValue should be betwen 0 and 1.
Please try adding a print statement to check if the original result_df is off before writing to disk. (Also, please check that the number of result lines is about 70,000) Something like:
import fastlmm.association as association
BED_FP = "plink_assoc_geno_0.1_maf_0.01.bed"
phenos = ["bno", "bwt", "abwt"]
for p in phenos:
PHENO_FP = "max_pheno_normalised_" + p + ".pheno"
results_df = association.single_snp(BED_FP, PHENO_FP, count_A1=True)
results_df.to_csv("fastlmm_assoc_max_pheno_" + p + ".assoc", sep="\t", index=False, header=True)
print(results_df[["Chr", "ChrPos", "PValue"]])
To check that FaST-LMM is correctly reading your BIM file, please run this:
from pysnptools.snpreader import Bed
import numpy as np
np.set_printoptions(suppress=True, precision=4)
BED_FP = "plink_assoc_geno_0.1_maf_0.01.bed"
bed = Bed(BED_FP,count_A1=True)
print(bed.pos)
It should print something like this (except for your data), where the first column is chromosome number, etc.
[[ 1. 0.8309 752566. ]
[ 1. 1.254 1130727. ]
[ 1. 1.5168 1365570. ]
...
[ 22. 74.9714 51016098. ]
[ 22. 75.289 51099432. ]
[ 22. 75.5082 51156933. ]]
Oh, wow, yes, @CarlKCarlK , it had something to do with the pandas
. When I print the results, this is what I see:
Chr ChrPos PValue
0 5.0 127185949.0 1.143824e-07
1 5.0 127186127.0 1.143824e-07
2 5.0 127185971.0 1.143824e-07
3 1.0 92490592.0 2.698665e-07
4 1.0 93606892.0 6.357632e-07
...
79564 11.0 109297703.0 0.999996
79565 11.0 109297753.0 0.999996
79566 11.0 109297757.0 0.999996
79567 11.0 109297715.0 0.999996
79568 1.0 78206372.0 0.999998
Also, I checked that input to the bim
file seems to be correct:
> bed = Bed(BED_FP,count_A1=True)
> print(bed.pos)
[[ 1. nan 18306.]
[ 1. nan 18402.]
[ 1. nan 18433.]
...
[ 16. nan 47825132.]
[ 16. nan 47825520.]
[ 16. nan 47852030.]]
I suppose for now I'll try pickling the dataframe and see if that works better for storage. I'm sorry, I hadn't encountered something like this before and appreciate your helpful suggestion. Thanks, again.
Excellent!
I'm sure you can get Panda saving to TSV working now that you know what to look for. (I worry about Pickle for long term storage because in the future you might not be able to get data back out.)
From: Asif Zubair @.> Sent: Monday, April 29, 2024 8:16 PM To: fastlmm/FaST-LMM @.> Cc: Carl Kadie @.>; Mention @.> Subject: Re: [fastlmm/FaST-LMM] PValue Output Format (Issue #39) Importance: High
Oh, wow, yes, @CarlKCarlKhttps://github.com/CarlKCarlK , it had something to do with the pandas. When I print the results, this is what I see:
Chr ChrPos PValue
0 5.0 127185949.0 1.143824e-07
1 5.0 127186127.0 1.143824e-07
2 5.0 127185971.0 1.143824e-07
3 1.0 92490592.0 2.698665e-07
4 1.0 93606892.0 6.357632e-07
...
79564 11.0 109297703.0 0.999996
79565 11.0 109297753.0 0.999996
79566 11.0 109297757.0 0.999996
79567 11.0 109297715.0 0.999996
79568 1.0 78206372.0 0.999998
Also, I checked that input to the bim file seems to be correct:
bed = Bed(BED_FP,count_A1=True)
print(bed.pos)
[[ 1. nan 18306.]
[ 1. nan 18402.]
[ 1. nan 18433.]
...
[ 16. nan 47825132.]
[ 16. nan 47825520.]
[ 16. nan 47852030.]]
I suppose for now I'll try pickling the dataframe and see if that works better for storage. I'm sorry, I hadn't encountered something like this before and appreciate your helpful suggestion. Thanks, again.
- Reply to this email directly, view it on GitHubhttps://github.com/fastlmm/FaST-LMM/issues/39#issuecomment-2084296675, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABR65P4A6WXYUAMOJ4JJSL3Y74EGHAVCNFSM6AAAAABGDPQBPOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBUGI4TMNRXGU. You are receiving this because you were mentioned.Message ID: @.**@.>>
Hi,
I have a quick question regarding the output format for the PValue from single_snp and single_snp_linreg. To confirm, the output provided is the raw PValue itself, not its -log10 transformation, correct?
Thanks Stefanie