Bad statistics - Githubissues

biomguler commented 7 months ago

Thank you for this great tool! I couldn't understand why it removed all variants because of bad statistics. Do you have any idea?

mysumstats.basiccheck() Fri Nov 17 22:24:41 2023 Start to check IDs... Fri Nov 17 22:24:41 2023 -Current Dataframe shape : 11169426 x 15 Fri Nov 17 22:24:41 2023 -Checking if SNPID is chr:pos:ref:alt...(separator: - ,: , ) Fri Nov 17 22:25:34 2023 Finished checking IDs successfully! Fri Nov 17 22:25:34 2023 Start to fix chromosome notation... Fri Nov 17 22:25:34 2023 -Current Dataframe shape : 11169426 x 15 Fri Nov 17 22:25:34 2023 -Checking CHR data type... Fri Nov 17 22:25:47 2023 -Variants with standardized chromosome notation: 11169426 Fri Nov 17 22:25:51 2023 -All CHR are already fixed... Fri Nov 17 22:26:17 2023 -Sanity check for CHR... Fri Nov 17 22:26:21 2023 -Removed 0 varaints with CHR < 1... Fri Nov 17 22:27:01 2023 Finished fixing chromosome notation successfully! Fri Nov 17 22:27:01 2023 Start to fix basepair positions... Fri Nov 17 22:27:01 2023 -Current Dataframe shape : 11169426 x 15 Fri Nov 17 22:27:01 2023 -Converting to Int64 data type ... Fri Nov 17 22:27:43 2023 -Position upper_bound is: 250,000,000 Fri Nov 17 22:27:47 2023 -Remove outliers: 0 Fri Nov 17 22:27:50 2023 -Converted all position to datatype Int64. Fri Nov 17 22:27:50 2023 Finished fixing basepair position successfully! Fri Nov 17 22:27:51 2023 Start to fix alleles... Fri Nov 17 22:27:51 2023 -Current Dataframe shape : 11169426 x 15 Fri Nov 17 22:27:51 2023 -Converted all bases to string datatype and UPPERCASE. Fri Nov 17 22:27:58 2023 -Detected 0 variants with alleles that contain bases other than A/C/T/G . Fri Nov 17 22:28:57 2023 Finished fixing allele successfully! Fri Nov 17 22:28:57 2023 Start sanity check for statistics ... Fri Nov 17 22:28:57 2023 -Current Dataframe shape : 11169426 x 15 Fri Nov 17 22:28:58 2023 -Checking if 0 <=N<= 2147483647 ... Fri Nov 17 22:29:02 2023 -Removed 0 variants with bad N. Fri Nov 17 22:29:02 2023 -Checking if 0 <=N_CASE<= 2147483647 ... Fri Nov 17 22:29:05 2023 -Removed 0 variants with bad N_CASE. Fri Nov 17 22:29:05 2023 -Checking if 0 <=N_CONTROL<= 2147483647 ... Fri Nov 17 22:29:08 2023 -Removed 0 variants with bad N_CONTROL. Fri Nov 17 22:29:08 2023 -Checking if N = N_CASE + N_CONTROL ... Fri Nov 17 22:29:08 2023 -Removed 11169426 variants with N != N_CASE + N_CONTROL. Fri Nov 17 22:29:08 2023 -Checking if 0 <EAF< 1 ... Fri Nov 17 22:29:08 2023 -Removed 0 variants with bad EAF. Fri Nov 17 22:29:08 2023 -Checking if 0 <=MAC<= 2147483647 ... Fri Nov 17 22:29:08 2023 -Removed 0 variants with bad MAC. Fri Nov 17 22:29:08 2023 -Checking if 0 <CHISQ< inf ... Fri Nov 17 22:29:08 2023 -Removed 0 variants with bad CHISQ. Fri Nov 17 22:29:08 2023 -Checking if 5e-300 < P < 1.000001 ... Fri Nov 17 22:29:08 2023 -Removed 0 variants with bad P. Fri Nov 17 22:29:08 2023 -Checking if -10 <BETA< 10 ... Fri Nov 17 22:29:08 2023 -Removed 0 variants with bad BETA. Fri Nov 17 22:29:08 2023 -Checking if 0 <SE< inf ... Fri Nov 17 22:29:08 2023 -Removed 0 variants with bad SE. Fri Nov 17 22:29:08 2023 -Checking if 0 <INFO< 1.000001 ... Fri Nov 17 22:29:08 2023 -Removed 0 variants with bad INFO. Fri Nov 17 22:29:08 2023 -Checking STATUS and converting STATUS to categories.... Fri Nov 17 22:29:09 2023 -Dropping 0 variants with NAs in the checked columns... Fri Nov 17 22:29:09 2023 -Removed 11169426 variants with bad statistics in total. Fri Nov 17 22:29:09 2023 -Data types for each column: Fri Nov 17 22:29:09 2023 -Column: SNPID CHR POS EA NEA EAF BETA SE CHISQ P INFO N N_CASE N_CONTROL STATUS Fri Nov 17 22:29:09 2023 -DType : object Int64 Int64 category category float32 float32 float32 float32 float64 float32 Int32 Int32 Int32 category Fri Nov 17 22:29:09 2023 Finished sanity check successfully! Fri Nov 17 22:29:09 2023 Start to normalize variants... Fri Nov 17 22:29:09 2023 -Current Dataframe shape : 0 x 15 Fri Nov 17 22:29:10 2023 -No available variants to normalize.. Fri Nov 17 22:29:10 2023 Finished normalizing variants successfully! Fri Nov 17 22:29:10 2023 Start to sort the genome coordinates... Fri Nov 17 22:29:10 2023 -Current Dataframe shape : 0 x 15 Fri Nov 17 22:29:10 2023 -Sorting genome coordinates... Fri Nov 17 22:29:10 2023 Finished sorting genome coordinates successfully! Fri Nov 17 22:29:10 2023 Start to reorder the columns... Fri Nov 17 22:29:10 2023 -Current Dataframe shape : 0 x 15 Fri Nov 17 22:29:10 2023 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,CHISQ,P,INFO,N,N_CASE,N_CONTROL,STATUS Fri Nov 17 22:29:10 2023 Finished sorting columns successfully!

biomguler commented 7 months ago

I found it! Removed 11169426 variants with N != N_CASE + N_CONTROL.

Thanks!

Cell_B.basiccheck() Fri Nov 17 22:24:41 2023 Start to check IDs... Fri Nov 17 22:24:41 2023 -Current Dataframe shape : 11169426 x 15 Fri Nov 17 22:24:41 2023 -Checking if SNPID is chr:pos:ref:alt...(separator: - ,: , ) Fri Nov 17 22:25:34 2023 Finished checking IDs successfully! Fri Nov 17 22:25:34 2023 Start to fix chromosome notation... Fri Nov 17 22:25:34 2023 -Current Dataframe shape : 11169426 x 15 Fri Nov 17 22:25:34 2023 -Checking CHR data type... Fri Nov 17 22:25:47 2023 -Variants with standardized chromosome notation: 11169426 Fri Nov 17 22:25:51 2023 -All CHR are already fixed... Fri Nov 17 22:26:17 2023 -Sanity check for CHR... Fri Nov 17 22:26:21 2023 -Removed 0 varaints with CHR < 1... Fri Nov 17 22:27:01 2023 Finished fixing chromosome notation successfully! Fri Nov 17 22:27:01 2023 Start to fix basepair positions... Fri Nov 17 22:27:01 2023 -Current Dataframe shape : 11169426 x 15 Fri Nov 17 22:27:01 2023 -Converting to Int64 data type ... Fri Nov 17 22:27:43 2023 -Position upper_bound is: 250,000,000 Fri Nov 17 22:27:47 2023 -Remove outliers: 0 Fri Nov 17 22:27:50 2023 -Converted all position to datatype Int64. Fri Nov 17 22:27:50 2023 Finished fixing basepair position successfully! Fri Nov 17 22:27:51 2023 Start to fix alleles... Fri Nov 17 22:27:51 2023 -Current Dataframe shape : 11169426 x 15 Fri Nov 17 22:27:51 2023 -Converted all bases to string datatype and UPPERCASE. Fri Nov 17 22:27:58 2023 -Detected 0 variants with alleles that contain bases other than A/C/T/G . Fri Nov 17 22:28:57 2023 Finished fixing allele successfully! Fri Nov 17 22:28:57 2023 Start sanity check for statistics ... Fri Nov 17 22:28:57 2023 -Current Dataframe shape : 11169426 x 15 Fri Nov 17 22:28:58 2023 -Checking if 0 <=N<= 2147483647 ... Fri Nov 17 22:29:02 2023 -Removed 0 variants with bad N. Fri Nov 17 22:29:02 2023 -Checking if 0 <=N_CASE<= 2147483647 ... Fri Nov 17 22:29:05 2023 -Removed 0 variants with bad N_CASE. Fri Nov 17 22:29:05 2023 -Checking if 0 <=N_CONTROL<= 2147483647 ... Fri Nov 17 22:29:08 2023 -Removed 0 variants with bad N_CONTROL. Fri Nov 17 22:29:08 2023 -Checking if N = N_CASE + N_CONTROL ... Fri Nov 17 22:29:08 2023 -Removed 11169426 variants with N != N_CASE + N_CONTROL. Fri Nov 17 22:29:08 2023 -Checking if 0 <EAF< 1 ... Fri Nov 17 22:29:08 2023 -Removed 0 variants with bad EAF. Fri Nov 17 22:29:08 2023 -Checking if 0 <=MAC<= 2147483647 ... Fri Nov 17 22:29:08 2023 -Removed 0 variants with bad MAC. Fri Nov 17 22:29:08 2023 -Checking if 0 <CHISQ< inf ... Fri Nov 17 22:29:08 2023 -Removed 0 variants with bad CHISQ. Fri Nov 17 22:29:08 2023 -Checking if 5e-300 < P < 1.000001 ... Fri Nov 17 22:29:08 2023 -Removed 0 variants with bad P. Fri Nov 17 22:29:08 2023 -Checking if -10 <BETA< 10 ... Fri Nov 17 22:29:08 2023 -Removed 0 variants with bad BETA. Fri Nov 17 22:29:08 2023 -Checking if 0 <SE< inf ... Fri Nov 17 22:29:08 2023 -Removed 0 variants with bad SE. Fri Nov 17 22:29:08 2023 -Checking if 0 <INFO< 1.000001 ... Fri Nov 17 22:29:08 2023 -Removed 0 variants with bad INFO. Fri Nov 17 22:29:08 2023 -Checking STATUS and converting STATUS to categories.... Fri Nov 17 22:29:09 2023 -Dropping 0 variants with NAs in the checked columns... Fri Nov 17 22:29:09 2023 -Removed 11169426 variants with bad statistics in total. Fri Nov 17 22:29:09 2023 -Data types for each column: Fri Nov 17 22:29:09 2023 -Column: SNPID CHR POS EA NEA EAF BETA SE CHISQ P INFO N N_CASE N_CONTROL STATUS Fri Nov 17 22:29:09 2023 -DType : object Int64 Int64 category category float32 float32 float32 float32 float64 float32 Int32 Int32 Int32 category Fri Nov 17 22:29:09 2023 Finished sanity check successfully! Fri Nov 17 22:29:09 2023 Start to normalize variants... Fri Nov 17 22:29:09 2023 -Current Dataframe shape : 0 x 15 Fri Nov 17 22:29:10 2023 -No available variants to normalize.. Fri Nov 17 22:29:10 2023 Finished normalizing variants successfully! Fri Nov 17 22:29:10 2023 Start to sort the genome coordinates... Fri Nov 17 22:29:10 2023 -Current Dataframe shape : 0 x 15 Fri Nov 17 22:29:10 2023 -Sorting genome coordinates... Fri Nov 17 22:29:10 2023 Finished sorting genome coordinates successfully! Fri Nov 17 22:29:10 2023 Start to reorder the columns... Fri Nov 17 22:29:10 2023 -Current Dataframe shape : 0 x 15 Fri Nov 17 22:29:10 2023 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,CHISQ,P,INFO,N,N_CASE,N_CONTROL,STATUS Fri Nov 17 22:29:10 2023 Finished sorting columns successfully!

biomguler commented 7 months ago

I think still this should be open, because in my case, N == N_CASE + N_CONTROL. Is there any bug? A few lines from my sumstat:

CHROM | GENPOS | ID | ALLELE0 | ALLELE1 | A1FREQ | A1FREQ_CASES | A1FREQ_CONTROLS | INFO | N | N_CASES | N_CONTROLS | TEST | BETA | SE | CHISQ | LOG10P | EXTRA | SNP | P -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 10 | 80124 | rs61839046 | A | G | 0.356468 | 0.356804 | 0.356465 | 0.907747 | 279620 | 2124 | 277496 | ADD | 0.004602 | 0.03425 | 0.018053 | 0.049092 | NA | 10:80124:A:G | 0.893116 10 | 86778 | rs28886014 | G | T | 0.400289 | 0.400641 | 0.400286 | 0.919689 | 279620 | 2124 | 277496 | ADD | 4.97E-05 | 0.033134 | 2.25E-06 | 0.00052 | NA | 10:86778:G:T | 0.998803 10 | 86782 | rs28765175 | A | T | 0.395905 | 0.396808 | 0.395898 | 0.91051 | 279620 | 2124 | 277496 | ADD | 0.001996 | 0.033413 | 0.003567 | 0.021192 | NA | 10:86782:A:T | 0.952376 10 | 88725 | rs10903893 | A | G | 0.798733 | 0.805857 | 0.798679 | 0.901452 | 279620 | 2124 | 277496 | ADD | 0.048423 | 0.040977 | 1.39643 | 0.624663 | NA | 10:88725:A:G | 0.237321

biomguler commented 7 months ago

I think I found the bug. I changed the code as below, and it is working without any problems. Can you please change it? fixdata.py line 788 matched_n = sumstats.loc[:,"N"] == sumstats.loc[:,"N_CASE"] + sumstats.loc[:,"N_CASE"]

it should be like matched_n = sumstats.loc[:,"N"] == sumstats.loc[:,"N_CASE"] + sumstats.loc[:,"N_CONTROL"]

Cloufield commented 7 months ago

Oops, sorry for the bug! I will fix it in the next version. Thanks a lot for your feedback.

Maube for now, you can skip N_CASE and N_CONTROL, or load them like other=["N_CASE ","N_CONTROL"].

Cloufield commented 7 months ago

Actually you could also specify which column to check using coltocheck using the current version like .basic_check(sanitycheckstats_args={"coltocheck":["MLOG10P","INFO","Z","BETA","SE","EAF","CHISQ","F","N","STATUS"]})

Cloufield commented 7 months ago

Hi, This is fixed in v3.4.32.

biomguler commented 7 months ago

Thanks, i just changed my local py and it worked. Thank you for this great tool!

Cloufield / gwaslab

Bad statistics #69