Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 16 forks source link

format_sumstats() does not check column type #161

Closed rmgpanw closed 1 year ago

rmgpanw commented 1 year ago

1. Bug description

I have recently tried using format_sumstats() on a file where the header was repeated in a handful of rows. This resulted in all columns being read into R as type character. The logs showed that the majority of rows had been removed due to BP being greater than the maximum expected value for each chromosome.

Expected behaviour

I eventually worked out the issue with my summary stats file, but I think it would be helpful if format_sumstats() could check column types are as expected, and either coerce to type numeric or raise an error. Perhaps before this line in check_bp_range()?

Thanks for considering.

2. Reproducible example

# Reprex - format
library(MungeSumstats)
library(data.table)

# Example data - presence of row of 'X's means these will all be read into R as
# type character
file_path <- tempfile()

df <- tibble::tribble(
                  ~CHR,    ~BP, ~A1, ~A2,  ~FRQ, ~BETA,    ~SE,     ~P,    ~N,
                   "1",    "1", "C", "G", "0.1", "0.1", "0.01", "0.05", "100",
                   "1", "1000", "G", "C", "0.1", "0.1", "0.01", "0.05", "100",
                   "1", "2000", "G", "C", "0.1", "0.1", "0.01", "0.05", "100",
                   "1",    "3", "A", "T", "0.1", "0.1", "0.01", "0.05", "100",
                   "1",    "4", "G", "C", "0.1", "0.1", "0.01", "0.05", "100",
                   "X",    "X", "X", "X",   "X",   "X",    "X",    "X",   "X"
                  )

fwrite(df, file_path)

# Note SNPs that will be removed as they are greater than chromosome length (as
# character)
chr_max_bp <- 130000
df$BP > chr_max_bp
#> [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE

# Read
result <- format_sumstats(file_path, ref_genome = "GRCh37")
#> 
#> 
#> ******::NOTE::******
#>  - Formatted results will be saved to `tempdir()` by default.
#>  - This means all formatted summary stats will be deleted upon ending the R session.
#>  - To keep formatted summary stats, change `save_path`  ( e.g. `save_path=file.path('./formatted',basename(path))` ),   or make sure to copy files elsewhere after processing  ( e.g. `file.copy(save_path, './formatted/' )`.
#>  ********************
#> Formatted summary statistics will be saved to ==>  /tmp/Rtmpm83GD3/fileb5f4667511bc.tsv.gz
#> Warning: replacing previous import 'utils::findMatches' by
#> 'S4Vectors::findMatches' when loading 'SNPlocs.Hsapiens.dbSNP155.GRCh37'
#> Reading header.
#> Tabular format detected.
#> Importing tabular file: /tmp/Rtmpm83GD3/fileb5f419a9331
#> Checking for empty columns.
#> Standardising column headers.
#> First line of summary statistics file:
#> CHR  BP  A1  A2  FRQ BETA    SE  P   N   
#> Summary statistics report:
#>    - 6 rows
#>    - 5 genome-wide significant variants (P<5e-8)
#>    - 2 chromosomes
#> Checking for multi-GWAS.
#> Checking for multiple RSIDs on one row.
#> Checking for merged allele column.
#> Checking A1 is uppercase
#> Checking A2 is uppercase
#> Checking for incorrect base-pair positions
#> 3 SNPs have been removed as their BP column is not in the range of 1 to the length of the chromosome
#> Loading SNPlocs data.
#> There is no SNP column found within the data. It must be inferred from other column information.
#> Ensuring all SNPs are on the reference genome.
#> Loading SNPlocs data.
#> Loading reference genome data.
#> Preprocessing RSIDs.
#> Validating RSIDs of 0 SNPs using BSgenome::snpsById...
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
#>     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
#>     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
#>     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#>     Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
#>     table, tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> Loading required package: stats4
#> 
#> Attaching package: 'S4Vectors'
#> The following objects are masked from 'package:data.table':
#> 
#>     first, second
#> The following object is masked from 'package:utils':
#> 
#>     findMatches
#> The following objects are masked from 'package:base':
#> 
#>     expand.grid, I, unname
#> 
#> Attaching package: 'IRanges'
#> The following object is masked from 'package:data.table':
#> 
#>     shift
#> BSgenome::snpsById done in 1 seconds.
#> Checking for correct direction of A1 (reference) and A2 (alternative allele).
#> Checking for missing data.
#> Checking for duplicate columns.
#> Warning in max(as.numeric(gsub(".*-", "", sumstats_dt$P)), na.rm = TRUE): no
#> non-missing arguments to max; returning -Inf
#> Ensuring that the N column is all integers.
#> The sumstats N column is not all integers, this could effect downstream analysis. These will be converted to integers.
#> Checking for duplicate SNPs from SNP ID.
#> Checking for SNPs with duplicated base-pair positions.
#> INFO column not available. Skipping INFO score filtering step.
#> Filtering SNPs, ensuring SE>0.
#> Ensuring all SNPs have N<5 std dev above mean.
#> Removing 'chr' prefix from CHR.
#> Making X/Y/MT CHR uppercase.
#> Checking for bi-allelic SNPs.
#> N already exists within sumstats_dt.
#> Sorting coordinates with 'data.table'.
#> Writing in tabular format ==> /tmp/Rtmpm83GD3/fileb5f4667511bc.tsv.gz
#> Summary statistics report:
#>    - 0 rows (0% of original 6 rows)
#>    - 0 unique variants
#>    - 0 genome-wide significant variants (P<5e-8)
#>    - 0 chromosomes
#> Done munging in 0.976 minutes.
#> Successfully finished preparing sumstats file, preview:
#> Reading header.
#> Warning in lapply(.SD, as.numeric): NAs introduced by coercion
#> Warning in lapply(.SD, as.numeric): NAs introduced by coercion

#> Warning in lapply(.SD, as.numeric): NAs introduced by coercion

#> Warning in lapply(.SD, as.numeric): NAs introduced by coercion
#>    SNP CHR BP A1 A2 FRQ BETA SE  P N
#> 1: SNP CHR BP A1 A2  NA   NA NA NA N
#> Returning path to saved data.
Al-Murphy commented 1 year ago

Hey! So a lot of the checks for correct data types are done throughout MSS so it isn't necessary to add a check data type function at the start. Like you mention however, there is an issue in check_bp_range() where the BP column, if not an integer, will cause a fail. I have updated this and pushed the change in v1.9.13. This seems to allow the example you gave to run as expected so I don't think any other alterations are needed but let me know if you find some other anomalies and I'll amend these too (and reopen this issue if you do find some). See below for the example, I've added an extra SNP which can be found in the reference file so something will be returned:

library(data.table)
# Example data - presence of row of 'X's means these will all be read into R as
# type character
file_path <- tempfile()

df <- tibble::tribble(
  ~CHR,    ~BP, ~A1, ~A2,  ~FRQ, ~BETA,    ~SE,     ~P,    ~N,
  "1",    "1", "C", "G", "0.1", "0.1", "0.01", "0.05", "100",
  "1", "1000", "G", "C", "0.1", "0.1", "0.01", "0.05", "100",
  "1", "2000", "G", "C", "0.1", "0.1", "0.01", "0.05", "100",
  "1",    "3", "A", "T", "0.1", "0.1", "0.01", "0.05", "100",
  "1",    "4", "G", "C", "0.1", "0.1", "0.01", "0.05", "100",
  "1","8490603", "T","C","0.17910","0.019","0.003","0.05","100",      
  "X",    "X", "X", "X",   "X",   "X",    "X",    "X",   "X"
)

fwrite(df, file_path)
fread(file_path)

result <- format_sumstats(file_path, ref_genome = "GRCh37",dbSNP=144)

******::NOTE::******
 - Formatted results will be saved to `tempdir()` by default.
 - This means all formatted summary stats will be deleted upon ending the R session.
 - To keep formatted summary stats, change `save_path`  ( e.g. `save_path=file.path('./formatted',basename(path))` ),   or make sure to copy files elsewhere after processing  ( e.g. `file.copy(save_path, './formatted/' )`.
 ******************** 

Formatted summary statistics will be saved to ==>  /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//Rtmpq79fRN/fileec0316f4f0f0.tsv.gz
Reading header.
Tabular format detected.
Importing tabular file: /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//Rtmpq79fRN/fileec031694d3d6
Checking for empty columns.
Standardising column headers.
First line of summary statistics file: 
CHR BP  A1  A2  FRQ BETA    SE  P   N   
Summary statistics report:
   - 7 rows
   - 6 genome-wide significant variants (P<5e-8)
   - 2 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Checking for incorrect base-pair positions
Coercing BP column to numeric.
1 SNPs have been removed as their BP column is not in the range of 1 to the length of the chromosome
Loading SNPlocs data.
There is no SNP column found within the data. It must be inferred from other column information.
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 1 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 2 seconds.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
Checking for missing data.
Checking for duplicate columns.
Ensuring that the N column is all integers.
The sumstats N column is not all integers, this could effect downstream analysis. These will be converted to integers.
Checking for duplicate SNPs from SNP ID.
Checking for SNPs with duplicated base-pair positions.
INFO column not available. Skipping INFO score filtering step.
Filtering SNPs, ensuring SE>0.
Ensuring all SNPs have N<5 std dev above mean.
Checking for bi-allelic SNPs.
N already exists within sumstats_dt.
Sorting coordinates with 'data.table'.
Writing in tabular format ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//Rtmpq79fRN/fileec0316f4f0f0.tsv.gz
Summary statistics report:
   - 1 rows (14.3% of original 7 rows)
   - 1 unique variants
   - 0 genome-wide significant variants (P<5e-8)
   - 1 chromosomes
Done munging in 0.059 minutes.
Successfully finished preparing sumstats file, preview:
Reading header.
        SNP CHR      BP A1 A2    FRQ  BETA    SE    P   N
1: rs301800   1 8490603  T  C 0.1791 0.019 0.003 0.05 100
Returning path to saved data.
Warning message:
In eval(jsub, SDenv, parent.frame()) : NAs introduced by coercion
rmgpanw commented 1 year ago

That's great, thank you! V small point, but the output still shows 6 genome-wide significant variants..., so perhaps coercing to type numeric could occur before that message?

Al-Murphy commented 1 year ago

Ah yes, missed that - updated in v1.9.14