Closed mykmal closed 1 year ago
Hey, thanks for digging into this - so there is other cases checked for in check_no_rs_snp
that would cause the IMPUTATION_SNP
column to be added to sumstats_dt
so I don't want to fully remove the step of adding it to miss_rs_chr_bp
. So I think it's best just to add a check to see if the column is present in sumstats_dt
first. See below for running your example data with the updated version:
> MungeSumstats::format_sumstats(path = "~/Downloads/sample.txt",
+ ref_genome = "GRCH37",
+ imputation_ind = TRUE,
+ return_data = TRUE)
******::NOTE::******
- Formatted results will be saved to `tempdir()` by default.
- This means all formatted summary stats will be deleted upon ending the R session.
- To keep formatted summary stats, change `save_path` ( e.g. `save_path=file.path('./formatted',basename(path))` ), or make sure to copy files elsewhere after processing ( e.g. `file.copy(save_path, './formatted/' )`.
********************
Formatted summary statistics will be saved to ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//Rtmpq79fRN/fileec037f235f2d.tsv.gz
Importing tabular file: ~/Downloads/sample.txt
Checking for empty columns.
Standardising column headers.
First line of summary statistics file:
SNP CHR BP A2 A1 FRQ FRQSE FRQMIN FRQMAX BETA SE P DIRECTION N
Summary statistics report:
- 10 rows
- 10 unique variants
- 0 genome-wide significant variants (P<5e-8)
- 4 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
5 SNP IDs appear to be made up of chr:bp, these will be replaced by their SNP ID from the reference genome
Loading SNPlocs data.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Checking for incorrect base-pair positions
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 10 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 75 seconds.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
There are 3 SNPs where A1 doesn't match the reference genome.
These will be flipped with their effect columns.
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Checking for missing data.
Checking for duplicate columns.
Checking for duplicate SNPs from SNP ID.
Checking for SNPs with duplicated base-pair positions.
INFO column not available. Skipping INFO score filtering step.
Filtering SNPs, ensuring SE>0.
Ensuring all SNPs have N<5 std dev above mean.
Checking for bi-allelic SNPs.
4 SNPs are non-biallelic. These will be removed.
1 SNPs (16.7%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
The FRQ column was mapped from one of the following from the inputted summary statistics file:
FRQ, EAF, FREQUENCY, FRQ_U, F_U, MAF, FREQ, FREQ_TESTED_ALLELE, FRQ_TESTED_ALLELE, FREQ_EFFECT_ALLELE, FRQ_EFFECT_ALLELE, EFFECT_ALLELE_FREQUENCY, EFFECT_ALLELE_FREQ, EFFECT_ALLELE_FRQ, A1FREQ, A1FRQ, A2FREQ, A2FRQ, ALLELE_FREQUENCY, ALLELE_FREQ, ALLELE_FRQ, AF, MINOR_AF, EFFECT_AF, A2_AF, EFF_AF, ALT_AF, ALTERNATIVE_AF, INC_AF, A_2_AF, TESTED_AF, AF1, ALLELEFREQ, ALT_FREQ, EAF_HRC, EFFECTALLELEFREQ, FREQ.A1.1000G.EUR, FREQ.A1.ESP.EUR, FREQ.ALLELE1.HAPMAPCEU, FREQ.B, FREQ1, FREQ1.HAPMAP, FREQ_EUROPEAN_1000GENOMES, FREQ_HAPMAP, FREQ_TESTED_ALLELE_IN_HRS, FRQ_A1, FRQ_U_113154, FRQ_U_31358, FRQ_U_344901, FRQ_U_43456, POOLED_ALT_AF, AF_ALT, AF.ALT, AF-ALT, ALT.AF, ALT-AF, A2.AF, A2-AF, AF.EFF, AF_EFF, AF_EFF
As frq_is_maf=TRUE, the FRQ column will not be renamed. If the FRQ values were intended to represent major allele frequency,
set frq_is_maf=FALSE to rename the column as MAJOR_ALLELE_FRQ and differentiate it from minor/effect allele frequency.
Sorting coordinates with 'data.table'.
Writing in tabular format ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//Rtmpq79fRN/fileec037f235f2d.tsv.gz
Summary statistics report:
- 6 rows (60% of original 10 rows)
- 6 unique variants
- 0 genome-wide significant variants (P<5e-8)
- 4 chromosomes
Done munging in 1.269 minutes.
Successfully finished preparing sumstats file, preview:
Reading header.
SNP CHR BP A1 A2 FRQ FRQSE FRQMIN FRQMAX BETA SE P DIRECTION N IMPUTATION_SNP flipped
1: rs1419981255 1 106365417 C T 0.0062 0.0019 0.0032 0.0074 0.0451 0.1491 0.7623 ++ 291 TRUE NA
2: rs1301730354 2 52080491 G A 0.0119 0.0009 0.0110 0.0129 -0.0160 0.1026 0.8760 -+ 291 TRUE NA
3: rs1231570724 3 81770380 A T 0.0034 0.0002 0.9963 0.9968 -0.2256 0.2415 0.3501 ++ 291 TRUE TRUE
4: rs10000078 4 81654981 A G 0.2148 0.0020 0.7831 0.7871 0.0224 0.0276 0.4184 -- 291 NA TRUE
Returning data directly.
SNP CHR BP A1 A2 FRQ FRQSE FRQMIN FRQMAX BETA SE P DIRECTION N IMPUTATION_SNP flipped
1: rs1419981255 1 106365417 C T 0.0062 0.0019 0.0032 0.0074 0.0451 0.1491 0.7623 ++ 291 TRUE NA
2: rs1301730354 2 52080491 G A 0.0119 0.0009 0.0110 0.0129 -0.0160 0.1026 0.8760 -+ 291 TRUE NA
3: rs1231570724 3 81770380 A T 0.0034 0.0002 0.9963 0.9968 -0.2256 0.2415 0.3501 ++ 291 TRUE TRUE
4: rs10000078 4 81654981 A G 0.2148 0.0020 0.7831 0.7871 0.0224 0.0276 0.4184 -- 291 NA TRUE
5: rs10000082 4 167310192 C T 0.0418 0.0027 0.0387 0.0441 -0.0067 0.0568 0.9060 +- 291 NA NA
6: rs10000076 4 183288360 T G 0.9759 0.0019 0.0221 0.0258 -0.0024 0.0761 0.9750 +- 291 NA TRUE
This is available in v1.9.15, let me know if there are any other issues!
1. Bug description
The
format_sumstats()
function crashes when theimputation_ind
parameter is set to TRUE and the GWAS summary statistics file contains some SNP identifiers coded as chr:bp.I narrowed down the issue to lines 275-276 in
check_no_rs_snp.R
and insertedbrowser()
right before that to investigate the cause. It turns out that at this point in the code, themiss_rs_chr_bp
data table has anIMPUTATION_SNP
column while thesumstats_dt
data table does not have that column. Thus, the two data tables cannot be concatenated throughrbindlist()
.This should be an easy fix: either adding the
IMPUTATION_SNP
column tosumstats_dt
or removing it frommiss_rs_chr_bp
. Since theIMPUTATION_SNP
column gets added tomiss_rs_chr_bp
right before the function errors out and only contains NAs at this point (see lines 269-272), I feel that perhaps it should not be added in the first place. However, I am not familiar with all the logic incheck_no_rs_snp()
so I decided to open a bug report and let you identify the best solution.2. Reproducible example
Code
Here is an example demonstrating the error message:
This shows that the issue is caused by the
IMPUTATION_SNP
column being present inmiss_rs_chr_bp
but not insumstats_dt
:Data
Here is the
sample.txt
file used in my example above, which I created by extracting 10 lines from a real GWAS summary statistics file:3. Session info