p-values > 1 - Githubissues

bschilder commented 2 years ago

After munging a bunch of summary stats from OpenGWAS, and trying to convert them to MAGMA files, I noticed MAGMA was throwing this error for a number of datasets, indicating that some values in the "P" col are >1 even after munging. Do we have any checks in place for this situation?

Sources

Possible reasons why this might occur in these cases:

"P" column means something other than p-value.
A bug in their p-value calculation that doesn't prevent the upper limit from being >1.
Some transformation we did in the course of munging (eg logging/anti-logging when we shouldn't have)?

Here's one of the problematic files: https://gwas.mrcieu.ac.uk/datasets/ubm-a-103/

I read the munged file in and got some more info on this P column. Based on this distribution, it definitely does seem like something is awry.

histo

Solutions

Check that all p-values are between 0-1, and it they aren't provide a warning to the user with some info on this (e.g. min/max values in this col, the number of rows with abnormal p-values) so that they can be aware if there's any larger issues with how their sumstats files were created to begin with.
If they meet some criterion (e.g. no negative values) we can automatically cut off the "p-values" at 1.

Al-Murphy commented 2 years ago

No this isn't something I thought to add but definitely sounds reasonable. This check should come after the small p-value check though: https://github.com/neurogenomics/MungeSumstats/blob/master/R/check_small_p_val.R as in certain instances the p column could be read in as a character field up to this point.

Do you want to add this to the branch you were working on? You should be able to copy the template from check_small_p_val() (make sure to include the imputation indicator - line 49-51) and then add the warnings and a parameter to remove p>1 and less than 0 with a default of TRUE?

bschilder commented 2 years ago

Sounds good, I was actually thinking the same thing. I'll work on that today.

bschilder commented 2 years ago

Added the changes to the NEWS, but here's the part relevant to this Issue:

Added checks for p-values >1 or <0 via args convert_large_p and convert_neg_p, respectively. These are both handled by the new internal function check_range_p_val, which also reports the number of SNPs found meeting these criteria to the console/logs.
check_small_p_val records which SNPs were imputed in a more robust way, by recording which SNPs met the criteria before making the changes (as opposed to inferred this info from which columns are 0 after making the changes). This function now only handles non-negative p-values, so that rows with negative p-values can be recorded/reported separately in the check_range_p_val step.
check_small_p_val now reports the number of SNPs <= 5e-324 to console/logs.
Unit tests have been added for both check_range_p_val and check_small_p_val.
parse_logs can now extract information reported by check_range_p_val and check_small_p_val.
New internal function logs_example provides easy access to log file stored in inst/extdata, and includes documentation on how it was created.
Both check_range_p_val and check_small_p_val now use #' @inheritParams format_sumstats to improve consistency of documentation.

Al-Murphy / MungeSumstats

p-values > 1 #76

Sources

Solutions