fastlmm / FaST-LMM

Python version of Factored Spectrally Transformed Linear Mixed Models
https://fastlmm.github.io/
Apache License 2.0
47 stars 11 forks source link

PValue Output Format #39

Closed snowformatics closed 7 months ago

snowformatics commented 7 months ago

Hi,

I have a quick question regarding the output format for the PValue from single_snp and single_snp_linreg. To confirm, the output provided is the raw PValue itself, not its -log10 transformation, correct?

Thanks Stefanie

CarlKCarlK commented 7 months ago

Yes, just PValue, no transformations.

From: snowformatics @.> Sent: Thursday, April 11, 2024 10:19 PM To: fastlmm/FaST-LMM @.> Cc: Subscribed @.***> Subject: [fastlmm/FaST-LMM] PValue Output Format (Issue #39) Importance: High

Hi,

I have a quick question regarding the output format for the PValue from single_snp and single_snp_linreg. To confirm, the output provided is the raw PValue itself, not its -log10 transformation, correct?

Thanks Stefanie

- Reply to this email directly, view it on GitHubhttps://github.com/fastlmm/FaST-LMM/issues/39, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABR65PY6P2HZKKVZQ4HSDTDY45VENAVCNFSM6AAAAABGDPQBPOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGIZTSMJQGYYTSOI. You are receiving this because you are subscribed to this thread.Message ID: @.**@.>>

asifzubair commented 7 months ago

@CarlKCarlK I seem to be getting negative Pvalues ... any idea how can that happen ? Thanks!

CarlKCarlK commented 7 months ago

@asifzubair When using single_snp or something else? Numbers like -1 or -2 or more "random" like -0.00045?

asifzubair commented 7 months ago

Yes, I'm using single_snp. I get something like -45 .. could it be underflow/overflow ? I can find the exact output and paste it here later.

CarlKCarlK commented 7 months ago

Yes, the more info the better. Also, please tell me (remind me) of the approximate number of individuals, SNPs, & phenotypes (usually 1) and covariates.

asifzubair commented 7 months ago

Hi @CarlKCarlK , here is the output of the single_snp:

> results_df[["Chr", "ChrPos", "PValue"]]
Chr | ChrPos | PValue
-- | -- | --
4.0 | 0.000039 | -47.907086
3.0 | 0.000040 | 47.885116
4.0 | 0.000051 | -46.647437
4.0 | 0.000051 | -46.647437
4.0 | 0.000065 | -46.566270
... | ... | ...
8.0 | 0.999922 | 0.001081
4.0 | 0.999939 | 0.000921
3.0 | 0.999946 | 0.000777
3.0 | 0.999951 | -0.000716
13.0 | 0.999987 | -0.000197

I don't have covariates, but here are other variables:

PhenotypeName   bwt
SampleSize  513
SNPCount    79569
CarlKCarlK commented 7 months ago

Asif,

Thanks for the additional info!

I did a run on the same size of data:

import numpy as np
from fastlmm.association import single_snp
from fastlmm.util import example_file # Download and return local file name
from pysnptools.snpreader import Bed

# set up data
##############################
from fastlmm.util import example_file # Download and return local file name
bed_fn = # redacted
pheno_fn = # redacted
bed = Bed(bed_fn,count_A1=False)[::28,::5] #this sampling makes my data 534 x 71289
# display(bed.iid_count,bed.sid_count)

# run gwas
###################################################################
results_df = single_snp(bed, pheno_fn, count_A1=False)
print(results_df[["Chr", "ChrPos", "PValue"]])

30 seconds of processing on a face machine produces

        Chr       ChrPos    PValue
0       3.0    3197918.0  0.000005
1      10.0   72900091.0  0.000028
2       1.0  199945630.0  0.000038
3       5.0   86013053.0  0.000044
4       4.0  113240529.0  0.000046
...     ...          ...       ...
71284   5.0   61832737.0  0.999965
71285  12.0  109750626.0  0.999965
71286  22.0   37902926.0  0.999969
71287   6.0   89914187.0  0.999974
71288   3.0   35578596.0  0.999976

Two things I notice. I expect the ChrPos to be integers. I expect the PValue to be sorted. Any idea why this is not so in your output?

CarlKCarlK commented 7 months ago

One more thing: I used a beta version FastLmm that works better with the latest version of Numpy and SciPyi: pip install fastlmm==0.6.8b1

asifzubair commented 7 months ago

Hi @CarlKCarlK , thank you for taking a look at this. Yes, I was confused about the ChrPos as well. I thought it had something to do with how fastlmm ingests data, but perhaps there is something up with my input data. Does ChrPos matter when we construct the Kinship ? Or it is only used for marker identification ?

Great, I'll try to install the beta version... I have been having some installation errors, but hopefully won't see in beta version 👍

CarlKCarlK commented 7 months ago

ChrPos is not used but it's weird value maybe a sign of other problems.

Does ChrPos matter when we construct the Kinship By default, we leave out the whole chromosome on which we are testing, so the chrom number matters, but not the position within the chromosome.

asifzubair commented 7 months ago

Yes, I tried with the beta version and got the same results. I was wondering where the chromosome positions are picked up from. This is how the bim file looks like:

% head plink_assoc.bim
1       SSUPER_SCAFFOLD_1000192_18306   0       18306   A       G
1       SSUPER_SCAFFOLD_1000192_18402   0       18402   T       A
1       SSUPER_SCAFFOLD_1000192_18433   0       18433   C       G
1       SSUPER_SCAFFOLD_1000192_18437   0       18437   G       T
1       SSUPER_SCAFFOLD_1000192_18468   0       18468   C       G
1       SSUPER_SCAFFOLD_1000192_18498   0       18498   G       A
1       SSUPER_SCAFFOLD_1000192_35325   0       35325   C       A
1       SSUPER_SCAFFOLD_1000192_35334   0       35334   T       G
1       SSUPER_SCAFFOLD_1000192_35392   0       35392   G       A
1       SSUPER_SCAFFOLD_1000192_48497   0       48497   T       C
asifzubair commented 7 months ago

This is total of my calling script:

import fastlmm.association as association

BED_FP = "../analysis_filtered_genotypes_2/plink_assoc_geno_0.1_maf_0.01.bed"
phenos = ["bno", "bwt", "abwt"]
for p in phenos:
    PHENO_FP  = "../analysis_filtered_genotypes_2/max_pheno_normalised_" + p + ".pheno"
    results_df = association.single_snp(BED_FP, PHENO_FP, count_A1=True)
    results_df.to_csv("fastlmm_assoc_max_pheno_" + p + ".assoc", sep="\t", index=False, header=True)
asifzubair commented 7 months ago

I'm just confused how p-values can be negative ... shouldn't these be constrained to be in [0, 1] ?

CarlKCarlK commented 7 months ago

You are correct. The pValue should be betwen 0 and 1.

Please try adding a print statement to check if the original result_df is off before writing to disk. (Also, please check that the number of result lines is about 70,000) Something like:

import fastlmm.association as association

BED_FP = "plink_assoc_geno_0.1_maf_0.01.bed"
phenos = ["bno", "bwt", "abwt"]
for p in phenos:
    PHENO_FP  = "max_pheno_normalised_" + p + ".pheno"
    results_df = association.single_snp(BED_FP, PHENO_FP, count_A1=True)
    results_df.to_csv("fastlmm_assoc_max_pheno_" + p + ".assoc", sep="\t", index=False, header=True)
    print(results_df[["Chr", "ChrPos", "PValue"]])

To check that FaST-LMM is correctly reading your BIM file, please run this:

  from pysnptools.snpreader import Bed
import numpy as np
np.set_printoptions(suppress=True, precision=4)

BED_FP = "plink_assoc_geno_0.1_maf_0.01.bed"
bed = Bed(BED_FP,count_A1=True)
print(bed.pos)

It should print something like this (except for your data), where the first column is chromosome number, etc.

[[       1.            0.8309   752566.    ]
 [       1.            1.254   1130727.    ]
 [       1.            1.5168  1365570.    ]
 ...
 [      22.           74.9714 51016098.    ]
 [      22.           75.289  51099432.    ]
 [      22.           75.5082 51156933.    ]]
asifzubair commented 7 months ago

Oh, wow, yes, @CarlKCarlK , it had something to do with the pandas. When I print the results, this is what I see:

   Chr       ChrPos        PValue
0  5.0  127185949.0  1.143824e-07
1  5.0  127186127.0  1.143824e-07
2  5.0  127185971.0  1.143824e-07
3  1.0   92490592.0  2.698665e-07
4  1.0   93606892.0  6.357632e-07
 ...
79564  11.0  109297703.0  0.999996
79565  11.0  109297753.0  0.999996
79566  11.0  109297757.0  0.999996
79567  11.0  109297715.0  0.999996
79568   1.0   78206372.0  0.999998

Also, I checked that input to the bim file seems to be correct:

> bed = Bed(BED_FP,count_A1=True)
> print(bed.pos)
[[       1.       nan    18306.]
 [       1.       nan    18402.]
 [       1.       nan    18433.]
 ...
 [      16.       nan 47825132.]
 [      16.       nan 47825520.]
 [      16.       nan 47852030.]]

I suppose for now I'll try pickling the dataframe and see if that works better for storage. I'm sorry, I hadn't encountered something like this before and appreciate your helpful suggestion. Thanks, again.

CarlKCarlK commented 7 months ago

Excellent!

I'm sure you can get Panda saving to TSV working now that you know what to look for. (I worry about Pickle for long term storage because in the future you might not be able to get data back out.)

From: Asif Zubair @.> Sent: Monday, April 29, 2024 8:16 PM To: fastlmm/FaST-LMM @.> Cc: Carl Kadie @.>; Mention @.> Subject: Re: [fastlmm/FaST-LMM] PValue Output Format (Issue #39) Importance: High

Oh, wow, yes, @CarlKCarlKhttps://github.com/CarlKCarlK , it had something to do with the pandas. When I print the results, this is what I see:

Chr ChrPos PValue

0 5.0 127185949.0 1.143824e-07

1 5.0 127186127.0 1.143824e-07

2 5.0 127185971.0 1.143824e-07

3 1.0 92490592.0 2.698665e-07

4 1.0 93606892.0 6.357632e-07

...

79564 11.0 109297703.0 0.999996

79565 11.0 109297753.0 0.999996

79566 11.0 109297757.0 0.999996

79567 11.0 109297715.0 0.999996

79568 1.0 78206372.0 0.999998

Also, I checked that input to the bim file seems to be correct:

bed = Bed(BED_FP,count_A1=True)

print(bed.pos)

[[ 1. nan 18306.]

[ 1. nan 18402.]

[ 1. nan 18433.]

...

[ 16. nan 47825132.]

[ 16. nan 47825520.]

[ 16. nan 47852030.]]

I suppose for now I'll try pickling the dataframe and see if that works better for storage. I'm sorry, I hadn't encountered something like this before and appreciate your helpful suggestion. Thanks, again.

- Reply to this email directly, view it on GitHubhttps://github.com/fastlmm/FaST-LMM/issues/39#issuecomment-2084296675, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABR65P4A6WXYUAMOJ4JJSL3Y74EGHAVCNFSM6AAAAABGDPQBPOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBUGI4TMNRXGU. You are receiving this because you were mentioned.Message ID: @.**@.>>