Closed rmgpanw closed 1 year ago
Hey! So a lot of the checks for correct data types are done throughout MSS so it isn't necessary to add a check data type function at the start. Like you mention however, there is an issue in check_bp_range() where the BP column, if not an integer, will cause a fail. I have updated this and pushed the change in v1.9.13. This seems to allow the example you gave to run as expected so I don't think any other alterations are needed but let me know if you find some other anomalies and I'll amend these too (and reopen this issue if you do find some). See below for the example, I've added an extra SNP which can be found in the reference file so something will be returned:
library(data.table)
# Example data - presence of row of 'X's means these will all be read into R as
# type character
file_path <- tempfile()
df <- tibble::tribble(
~CHR, ~BP, ~A1, ~A2, ~FRQ, ~BETA, ~SE, ~P, ~N,
"1", "1", "C", "G", "0.1", "0.1", "0.01", "0.05", "100",
"1", "1000", "G", "C", "0.1", "0.1", "0.01", "0.05", "100",
"1", "2000", "G", "C", "0.1", "0.1", "0.01", "0.05", "100",
"1", "3", "A", "T", "0.1", "0.1", "0.01", "0.05", "100",
"1", "4", "G", "C", "0.1", "0.1", "0.01", "0.05", "100",
"1","8490603", "T","C","0.17910","0.019","0.003","0.05","100",
"X", "X", "X", "X", "X", "X", "X", "X", "X"
)
fwrite(df, file_path)
fread(file_path)
result <- format_sumstats(file_path, ref_genome = "GRCh37",dbSNP=144)
******::NOTE::******
- Formatted results will be saved to `tempdir()` by default.
- This means all formatted summary stats will be deleted upon ending the R session.
- To keep formatted summary stats, change `save_path` ( e.g. `save_path=file.path('./formatted',basename(path))` ), or make sure to copy files elsewhere after processing ( e.g. `file.copy(save_path, './formatted/' )`.
********************
Formatted summary statistics will be saved to ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//Rtmpq79fRN/fileec0316f4f0f0.tsv.gz
Reading header.
Tabular format detected.
Importing tabular file: /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//Rtmpq79fRN/fileec031694d3d6
Checking for empty columns.
Standardising column headers.
First line of summary statistics file:
CHR BP A1 A2 FRQ BETA SE P N
Summary statistics report:
- 7 rows
- 6 genome-wide significant variants (P<5e-8)
- 2 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Checking for incorrect base-pair positions
Coercing BP column to numeric.
1 SNPs have been removed as their BP column is not in the range of 1 to the length of the chromosome
Loading SNPlocs data.
There is no SNP column found within the data. It must be inferred from other column information.
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 1 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 2 seconds.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
Checking for missing data.
Checking for duplicate columns.
Ensuring that the N column is all integers.
The sumstats N column is not all integers, this could effect downstream analysis. These will be converted to integers.
Checking for duplicate SNPs from SNP ID.
Checking for SNPs with duplicated base-pair positions.
INFO column not available. Skipping INFO score filtering step.
Filtering SNPs, ensuring SE>0.
Ensuring all SNPs have N<5 std dev above mean.
Checking for bi-allelic SNPs.
N already exists within sumstats_dt.
Sorting coordinates with 'data.table'.
Writing in tabular format ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//Rtmpq79fRN/fileec0316f4f0f0.tsv.gz
Summary statistics report:
- 1 rows (14.3% of original 7 rows)
- 1 unique variants
- 0 genome-wide significant variants (P<5e-8)
- 1 chromosomes
Done munging in 0.059 minutes.
Successfully finished preparing sumstats file, preview:
Reading header.
SNP CHR BP A1 A2 FRQ BETA SE P N
1: rs301800 1 8490603 T C 0.1791 0.019 0.003 0.05 100
Returning path to saved data.
Warning message:
In eval(jsub, SDenv, parent.frame()) : NAs introduced by coercion
That's great, thank you! V small point, but the output still shows 6 genome-wide significant variants...
, so perhaps coercing to type numeric could occur before that message?
Ah yes, missed that - updated in v1.9.14
1. Bug description
I have recently tried using
format_sumstats()
on a file where the header was repeated in a handful of rows. This resulted in all columns being read into R as type character. The logs showed that the majority of rows had been removed due to BP being greater than the maximum expected value for each chromosome.Expected behaviour
I eventually worked out the issue with my summary stats file, but I think it would be helpful if
format_sumstats()
could check column types are as expected, and either coerce to type numeric or raise an error. Perhaps before this line incheck_bp_range()
?Thanks for considering.
2. Reproducible example