Closed abigailterkuile closed 2 years ago
Hi,
Thanks for posting the issue. MSS should be able to handle a mixture of RS IDs and CHR:BP:A2:A1.
Can you upload a small summary statistics file that replicates this issue (or a link to the full summary statistics)? I can't replicate the issue with the example SNPs you list:
CHR BP SNP A1 A2 P beta
1 701203 chr1:701203:G:T T G 0.8 1.2
1 710225 rs185127847 A T 0.65 .001
1 722408 chr1:722408:G:C C G 0.45 .01233
It runs fine:
> sumstats <- fread("~/Downloads/test.txt")
> reformatted <-
+ MungeSumstats::format_sumstats(sumstats,
+ ref_genome="GRCh38")
******::NOTE::******
- Formatted results will be saved to `tempdir()` by default.
- This means all formatted summary stats will be deleted upon ending the R session.
- To keep formatted summary stats, change `save_path` ( e.g. `save_path=file.path('./formatted',basename(path))` ), or make sure to copy files elsewhere after processing ( e.g. `file.copy(save_path, './formatted/' )`.
********************
Formatted summary statistics will be saved to ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//Rtmp6XstOa/file2cb295c2601.tsv.gz
Standardising column headers.
First line of summary statistics file:
CHR BP SNP A1 A2 P beta
Summary statistics report:
- 3 rows
- 3 unique variants
- 0 genome-wide significant variants (P<5e-8)
- 1 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
2 SNP IDs are not correctly formatted. These will be corrected from the reference genome.
Loading SNPlocs data.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 2 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 12 seconds.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
There are 2 SNPs where A1 doesn't match the reference genome.
These will be flipped with their effect columns.
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Checking for missing data.
Checking for duplicate columns.
Checking for duplicate SNPs from SNP ID.
Checking for SNPs with duplicated base-pair positions.
INFO column not available. Skipping INFO score filtering step.
SE is not present but can be imputed with BETA & P. Set impute_se=TRUE and rerun to do this.
Ensuring all SNPs have N<5 std dev above mean.
Removing 'chr' prefix from CHR.
Making X/Y CHR uppercase.
Checking for bi-allelic SNPs.
Warning: When method is an integer, must be >0.
Could not recognize genome build of:
- target_genome
These will be inferred from the data.
Sorting coordinates.
Writing in tabular format ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//Rtmp6XstOa/file2cb295c2601.tsv.gz
Summary statistics report:
- 2 rows (66.7% of original 3 rows)
- 2 unique variants
- 0 genome-wide significant variants (P<5e-8)
- 1 chromosomes
Successfully finished preparing sumstats file, preview:
Reading header.
SNP CHR BP A1 A2 P BETA
1: rs185127847 1 710225 T A 0.65 -0.00100
2: rs760310201 1 722408 G C 0.45 -0.01233
Returning path to saved data.
Thanks for the quick response. Here's a small version of the summary statistics file: GAD_1000_snps.txt
Thanks for that however it still runs without issue for me, could you. try installing the latest version of MSS from github to see if that helps?
Here is my run output:
sumstats <- fread("~/Downloads/GAD_1000_snps.txt")
> reformatted <-
+ MungeSumstats::format_sumstats(sumstats,
+ ref_genome="GRCh38")
******::NOTE::******
- Formatted results will be saved to `tempdir()` by default.
- This means all formatted summary stats will be deleted upon ending the R session.
- To keep formatted summary stats, change `save_path` ( e.g. `save_path=file.path('./formatted',basename(path))` ), or make sure to copy files elsewhere after processing ( e.g. `file.copy(save_path, './formatted/' )`.
********************
Formatted summary statistics will be saved to ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//Rtmp6XstOa/file2cb2aa2becc.tsv.gz
Standardising column headers.
First line of summary statistics file:
CHR BP SNP A1 A2 A1FREQ INFO N BETA SE P
Summary statistics report:
- 999 rows
- 999 unique variants
- 0 genome-wide significant variants (P<5e-8)
- 1 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
273 SNP IDs are not correctly formatted. These will be corrected from the reference genome.
Loading SNPlocs data.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 986 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 12 seconds.
49 SNPs are not on the reference genome. These will be corrected from the reference genome.
Loading SNPlocs data.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 939 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 12 seconds.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
There are 940 SNPs where A1 doesn't match the reference genome.
These will be flipped with their effect columns.
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Checking for missing data.
Checking for duplicate columns.
Checking for duplicate SNPs from SNP ID.
1 RSIDs are duplicated in the sumstats file. These duplicates will be removed
Checking for SNPs with duplicated base-pair positions.
Filtering SNPs based on INFO score.
702 SNPs are below the INFO threshold of 0.9 and will be removed.
Filtering SNPs, ensuring SE>0.
Ensuring all SNPs have N<5 std dev above mean.
Removing 'chr' prefix from CHR.
Making X/Y CHR uppercase.
Checking for bi-allelic SNPs.
11 SNPs are non-biallelic. These will be removed.
N already exists within sumstats_dt.
195 SNPs (86.3%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
The FRQ column was mapped from one of the following from the inputted summary statistics file:
FRQ, EAF, FREQUENCY, FRQ_U, F_U, MAF, FREQ, FREQ_TESTED_ALLELE, FRQ_TESTED_ALLELE, FREQ_EFFECT_ALLELE, FRQ_EFFECT_ALLELE, EFFECT_ALLELE_FREQUENCY, EFFECT_ALLELE_FREQ, EFFECT_ALLELE_FRQ, A1FREQ, A1FRQ, A2FREQ, A2FRQ, ALLELE_FREQUENCY, ALLELE_FREQ, ALLELE_FRQ, AF, MINOR_AF, EFFECT_AF, A2_AF, EFF_AF, ALT_AF, ALTERNATIVE_AF, INC_AF, A_2_AF, TESTED_AF, AF1, ALLELEFREQ, ALT_FREQ, EAF_HRC, EFFECTALLELEFREQ, FREQ.A1.1000G.EUR, FREQ.A1.ESP.EUR, FREQ.ALLELE1.HAPMAPCEU, FREQ.B, FREQ1, FREQ1.HAPMAP, FREQ_EUROPEAN_1000GENOMES, FREQ_HAPMAP, FREQ_TESTED_ALLELE_IN_HRS, FRQ_A1, FRQ_U_113154, FRQ_U_31358, FRQ_U_344901, FRQ_U_43456, POOLED_ALT_AF
As frq_is_maf=TRUE, the FRQ column will not be renamed. If the FRQ values were intended to represent major allele frequency,
set frq_is_maf=FALSE to rename the column as MAJOR_ALLELE_FRQ and differentiate it from minor/effect allele frequency.
Could not recognize genome build of:
- target_genome
These will be inferred from the data.
Sorting coordinates.
Writing in tabular format ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//Rtmp6XstOa/file2cb2aa2becc.tsv.gz
Summary statistics report:
- 226 rows (22.6% of original 999 rows)
- 226 unique variants
- 0 genome-wide significant variants (P<5e-8)
- 1 chromosomes
Successfully finished preparing sumstats file, preview:
Reading header.
SNP CHR BP A1 A2 FRQ INFO N BETA SE P
1: rs138388092 1 802843 T C 0.9828785 0.903520 20452 -0.1761780 0.1699330 0.2998513
2: rs12562034 1 833068 G A 0.8941450 0.999894 20452 0.0160558 0.0694009 0.8170437
3: rs151160018 1 841176 C T 0.9918606 0.903191 20452 -0.1062280 0.2545640 0.6764629
4: rs112618790 1 841852 C T 0.9103400 0.939107 20452 0.0328076 0.0775737 0.6723525
Returning path to saved data.
Updating the package to the latest version from GitHub resolved the issue. Thanks very much, Alan!
Error in split out chr:bp (check_no_rs_snp)
Hi there,
I have GWAS summary statistics with a mixture of rsIDs and CHR:BP:A2:A1 - e.g: CHR BP SNP A1 A2
1 701203 chr1:701203:G:T T G
1 710225 rs185127847 A T
1 722408 chr1:722408:G:C C G
I get an error message when running format_sumstats:
Code
Console output
Session info