Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 16 forks source link

ukb-b-14043: 0 variants remaining #75

Closed bschilder closed 2 years ago

bschilder commented 2 years ago

This is a weird one, but when using import_sumstats to import "ukb-b-14043", the intiial report in the logs indicates there's 0 SNPs (possibly not recognizing the SNP col yet? or it has to be imputed?). Then further along, it detects 3M SNPs. Then by the end it goes back down to 0. MungeSumstats_log_msg.txt

bschilder commented 2 years ago

I think this might be related to our issue with MAF being interpreted as INFO. I'll rerun and confirm.

Al-Murphy commented 2 years ago

Cool, let me know!

bschilder commented 2 years ago

MungeSumstats log file

Using local VCF.
File already tabix-indexed.
Finding empty VCF columns based on first 10,000 rows.
Dropping 1 duplicate columns.
1 sample detected: UKB-b-14043
Constructing ScanVcfParam object.
VCF contains: 9,851,866 variant(s) x 1 sample(s)
Reading VCF file: multi-threaded (50 threads)
Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Dropping 1 duplicate columns.

Renaming ID as SNP.
VCF file has -log10 P-values; these will be converted to unadjusted p-values in the 'P' column.
No INFO (SI) column detected.
Standardising column headers.
First line of summary statistics file: 
SNP chr BP  end REF ALT FILTER  AF  ES  SE  LP  P   
Summary statistics report:
   - 3,291,295 rows
   - 3,290,823 unique variants
   - 19 genome-wide significant variants (P<5e-8)
   - 22 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Inferring genome build.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 10,000 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 152 seconds.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 10,000 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 168 seconds.
Inferred genome build: GRCH37
Checking SNP RSIDs.
263 SNP IDs are not correctly formatted. These will be corrected from the reference genome.
Loading SNPlocs data.
Writing in tabular format ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/ukb-b-14043/logs/snp_not_found_from_chr_bp.tsv.gz
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 3,290,820 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 174 seconds.
1,566 SNPs are not on the reference genome. These will be corrected from the reference genome.
Loading SNPlocs data.
Writing in tabular format ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/ukb-b-14043/logs/snp_not_found_from_chr_bp_2.tsv.gz
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 3,290,454 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 172 seconds.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
There are 10 SNPs where neither A1 nor A2 match the reference genome. These will be removed.
Writing in tabular format ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/ukb-b-14043/logs/alleles_dont_match_ref_gen.tsv.gz
There are 10 SNPs where A1 doesn't match the reference genome.
These will be flipped with their effect columns.
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Checking for missing data.
Checking for duplicate columns.
Checking for duplicate SNPs from SNP ID.
474 RSIDs are duplicated in the sumstats file. These duplicates will be removed
Writing in tabular format ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/ukb-b-14043/logs/dup_snp_id.tsv.gz
Checking for SNPs with duplicated base-pair positions.
INFO column not available. Skipping INFO score filtering step.
Filtering SNPs, ensuring SE>0.
Ensuring all SNPs have N<5 std dev above mean.
Removing 'chr' prefix from CHR.
Making X/Y CHR uppercase.
Checking for bi-allelic SNPs.
1,912,290 SNPs are non-biallelic. These will be removed.
Writing in tabular format ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/ukb-b-14043/logs/snp_bi_allelic.tsv.gz
Warning: When method is an integer, must be >0.
310,817 SNPs (22.6%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
The FRQ column was mapped from one of the following from the inputted  summary statistics file:
FRQ, EAF, FREQUENCY, FRQ_U, F_U, MAF, FREQ, FREQ_TESTED_ALLELE, FRQ_TESTED_ALLELE, FREQ_EFFECT_ALLELE, FRQ_EFFECT_ALLELE, EFFECT_ALLELE_FREQUENCY, EFFECT_ALLELE_FREQ, EFFECT_ALLELE_FRQ, A1FREQ, A1FRQ, A2FREQ, A2FRQ, ALLELE_FREQUENCY, ALLELE_FREQ, ALLELE_FRQ, AF, MINOR_AF, EFFECT_AF, A2_AF, EFF_AF, ALT_AF, ALTERNATIVE_AF, INC_AF, A_2_AF, TESTED_AF, AF1, ALLELEFREQ, ALT_FREQ, EAF_HRC, EFFECTALLELEFREQ, FREQ.A1.1000G.EUR, FREQ.A1.ESP.EUR, FREQ.ALLELE1.HAPMAPCEU, FREQ.B, FREQ1, FREQ1.HAPMAP, FREQ_EUROPEAN_1000GENOMES, FREQ_HAPMAP, FREQ_TESTED_ALLELE_IN_HRS, FRQ_A1, FRQ_U_113154, FRQ_U_31358, FRQ_U_344901, FRQ_U_43456, POOLED_ALT_AF
As frq_is_maf=TRUE, the FRQ column will not be renamed. If the FRQ values were intended to represent major allele frequency,
set frq_is_maf=FALSE to rename the column as MAJOR_ALLELE_FRQ and differentiate it from minor/effect allele frequency.
Could not recognize genome build of:
 - target_genome
These will be inferred from the data.
Sorting coordinates.
Writing in tabular format ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/ukb-b-14043/ukb-b-14043.tsv.gz
Summary statistics report:
   - 1,378,154 rows (41.9% of original 3,291,295 rows)
   - 1,378,154 unique variants
   - 11 genome-wide significant variants (P<5e-8)
   - 22 chromosomes
Successfully finished preparing sumstats file, preview:
Reading header.
          SNP CHR     BP A1 A2    END FILTER      FRQ         BETA          SE         LP          P
1:  rs2462492   1  54676  C  T  54676   PASS 0.400339 -6.53664e-04 0.000314002 1.43180000 0.03699985
2:  rs6680723   1 534192  C  T 534192   PASS 0.240921 -2.54691e-04 0.000353517 0.32790200 0.47000015
3: rs12029736   1 706368  A  G 706368   PASS 0.515732  6.11875e-06 0.000219453 0.00877392 0.98000001
4:  rs3115847   1 763394  G  A 763394   PASS 0.706822 -2.42389e-04 0.000257204 0.45593200 0.34999996

Seems to be doing much better with this iteration of MSS!

Explanation

I think this was originally due to how the INFO column used to be incorrectly parsed, which has since been fixed: 3,274,981 SNPs are below the INFO threshold of 0.9 and will be removed.

This has since been fixed: #100