Closed bschilder closed 2 years ago
Quick update:
Tried replacing the SNP col with a dummy col of row numbers.
dat <- data.table::fread("https://ctg.cncr.nl/documents/p1651/PGCALZ2sumstatsExcluding23andMe.txt.gz")
dat[,SNP:=seq_len(nrow(dat))]
tmp <- file.path(tempdir(),basename(path))
data.table::fwrite(dat, tmp, sep="\t", nThread = 50)
but this produces a different error when running it through format_sumstats
:
data/GWAS_sumstats/PGC-ALZ2/logs/MungeSumstats_log_msg.txt
Reading header.
Tabular format detected.
Importing tabular file: /tmp/Rtmp061dc1/PGCALZ2sumstatsExcluding23andMe.txt.gz
|--------------------------------------------------|
|==================================================|
Standardising column headers.
First line of summary statistics file:
chr PosGRCh37 testedAllele otherAllele z p N SNP
Summary statistics report:
- 12,688,339 rows
- 12,688,339 unique variants
- 3,570 genome-wide significant variants (P<5e-8)
- 22 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
Checking for merged allele column.
Checking A1 is uppercase
Ensuring all SNPs are on the reference genome.
Loading reference genome data.
Error in seq_len(nrow(x))[-irows] :
only 0's may be mixed with negative subscripts
However, when I make all dummy RSIDs characters by adding the prefix "snp", the pipeline runs to completion:
dat[,SNP:=paste0("snp",seq_len(nrow(dat)))]
data/GWAS_sumstats/PGC-ALZ2/logs/MungeSumstats_log_msg.txt
Reading header.
Tabular format detected.
Importing tabular file: /tmp/Rtmp061dc1/PGCALZ2sumstatsExcluding23andMe.txt.gz
|--------------------------------------------------|
|==================================================|
|--------------------------------------------------|
|==================================================|
Standardising column headers.
First line of summary statistics file:
chr PosGRCh37 testedAllele otherAllele z p N SNP
Summary statistics report:
- 12,688,339 rows
- 12,688,339 unique variants
- 3,570 genome-wide significant variants (P<5e-8)
- 22 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
Checking for merged allele column.
Checking A1 is uppercase
Ensuring all SNPs are on the reference genome.
Loading reference genome data.
12,688,339 SNPs are not on the reference genome. These will be corrected from the reference genome.
Writing in tabular format ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/PGC-ALZ2/logs/snp_not_found_from_chr_bp.tsv.gz
Loading reference genome data.
There is no A1 or A2 allele information column found within the data. It must be inferred from other column information.
One of A1/A2 are missing, allele flipping will be tested
Deriving A2 from reference genome
WARNING: Inferring the alternative allele (A2) from the reference genome. In some instances, there are more than one
alternative allele. Arbitrarily, only the first will be kept. See column `alt_alleles` in your returned sumstats file
for all alternative alleles.
Writing in tabular format ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/PGC-ALZ2/logs/alleles_not_found_from_snp.tsv.gz
Checking for correct direction of A1 (reference) and A2 (alternative allele).
There are 877,359 SNPs where neither A1 nor A2 match the reference genome. These will be removed.
Writing in tabular format ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/PGC-ALZ2/logs/alleles_dont_match_ref_gen.tsv.gz
Checking for missing data.
Checking for duplicate columns.
Checking for duplicate SNPs from SNP ID.
20,228 RSIDs are duplicated in the sumstats file. These duplicates will be removed
Writing in tabular format ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/PGC-ALZ2/logs/dup_snp_id.tsv.gz
Checking for SNPs with duplicated base-pair positions.
Ensuring all SNPs have N<5 std dev above mean.
Removing 'chr' prefix from CHR.
Making X/Y CHR uppercase.
Checking for bi-allelic SNPs.
343,205 SNPs are non-biallelic. These will be removed.
Writing in tabular format ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/PGC-ALZ2/logs/snp_bi_allelic.tsv.gz
N already exists within sumstats_dt.
Could not recognize genome build of:
- target_genome
These will be inferred from the data.
Sorting coordinates.
Writing in tabular format ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/PGC-ALZ2.tsv.gz
Summary statistics report:
- 11,158,539 rows (87.9% of original 12,688,339 rows)
- 11,151,777 unique variants
- 2,432 genome-wide significant variants (P<5e-8)
- 22 chromosomes
Successfully finished preparing sumstats file, preview:
Reading header.
SNP CHR BP A1 A2 TESTEDALLELE Z P N alt_alleles
1: rs533499096 1 14860 G A T -1.27587418 0.2020 74004 A
2: rs373918278 1 17765 G A A 0.09388564 0.9252 74004 A
3: rs763779009 1 17767 G A A -0.42614801 0.6700 74004 A
4: rs199900651 1 48186 T G G 0.91308089 0.3612 74004 G
Hello!!
We'll need to add to the mapping files accordingly: "PosGRCh37" --> "CHR" "testedAllele" --> "A1"
Do you mean BP not CHR? And the column matching already makes the columns upper case before checking. I'll add these though (and a few others like PosGRCh38 etc)!
On the main issue of not recognising remote sumstats - MungeSumstats wasn't really made to do this (just yet). I think the best solution would be to download the sumstats as an extra step before running any format checks if the file inputted is a https other than an IEU GWAS. It can then be treated like any local sumstats. I think this approach is necessary as the remote file could be a vcf or any other type so it will need to go through the same checks as a local sumstats anyway. Also it is hard to check if a remote file actually exists (see here) so just trying to download it before a lot of other code runs and giving a meaningful error through this seems best.
Let me know if I'm missing anything with this solution? Should get a fix pushed today otherwise.
Cheers
Do you mean BP not CHR? And the column matching already makes the columns upper case before checking. I'll add these though (and a few others like PosGRCh38 etc)!
Yup! My bad, just corrected this in the original post. Thanks!
On the main issue of not recognising remote sumstats - MungeSumstats wasn't really made to do this (just yet). I think the best solution would be to download the sumstats as an extra step before running any format checks if the file inputted is a https other than an IEU GWAS. It can then be treated like any local sumstats. I think this approach is necessary as the remote file could be a vcf or any other type so it will need to go through the same checks as a local sumstats anyway. Also it is hard to check if a remote file actually exists (see here) so just trying to download it before a lot of other code runs and giving a meaningful error through this seems best.
I think I actually added some code to recognize remote URLs, but I only accounted for OpenGWS VCFs. Should be easy enough to adapt to other remote files though, since we could still infer whether it's vcf or tabular format from the URL, and both our read_vcf
function and fread
are able to read in remote files:
https://github.com/neurogenomics/MungeSumstats/blob/c3c30269cbd7ba0f18b32306484988c5e215c732/R/read_vcf.R#L45
You're right though in that checking whether that remote file actually exists is another issue. But I think enabling URLs to be passed to path would be worth it bc it means users don't have to do the extra step of downloading the file beforehand.
Okay it should be able to handle remote sumstats now, I'm using fread to read in the header and then letting the current code figure out what to do with it from there. Note even though you set the n with fread, it still downloads the full file so this is a little bit of a repeated step but doesn't add much time at all if your internet access is okay. See below for the output will push once it passes checks locally:
> gwas_paths <- MungeSumstats::format_sumstats(
+ path = "https://ctg.cncr.nl/documents/p1651/PGCALZ2sumstatsExcluding23andMe.txt.gz",
+ save_path = "~/Downloads/test.tsv.gz",
+ ref_genome = "GRCh37",
+ nThread = 40,
+ force_new = TRUE,
+ #### Record logs
+ #log_folder = "~/Downloads/",
+ #log_folder_ind = TRUE,
+ #log_mungesumstats_msgs = TRUE
+ )
Formatted summary statistics will be saved to ==> /Users/alanmurphy/Downloads/test.tsv.gz
Reading header.
[100%] Downloaded 314825956 bytes...
Tabular format detected.
Importing tabular file: https://ctg.cncr.nl/documents/p1651/PGCALZ2sumstatsExcluding23andMe.txt.gz
[100%] Downloaded 314825956 bytes...
|--------------------------------------------------|
|==================================================|
Standardising column headers.
First line of summary statistics file:
chr PosGRCh37 testedAllele otherAllele z p N
Summary statistics report:
- 12,688,339 rows
- 3,570 genome-wide significant variants (P<5e-8)
- 22 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
There is no SNP column found within the data. It must be inferred from other column information.
Ensuring all SNPs are on the reference genome.
Loading reference genome data.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
There are 6,185 SNPs where neither A1 nor A2 match the reference genome. These will be removed.
There are 871,174 SNPs where A1 doesn't match the reference genome.
These will be flipped with their effect columns.
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Checking for missing data.
Checking for duplicate columns.
Checking for duplicate SNPs from SNP ID.
22,977 RSIDs are duplicated in the sumstats file. These duplicates will be removed
Checking for SNPs with duplicated base-pair positions.
Ensuring all SNPs have N<5 std dev above mean.
Removing 'chr' prefix from CHR.
Making X/Y CHR uppercase.
Checking for bi-allelic SNPs.
372,165 SNPs are non-biallelic. These will be removed.
N already exists within sumstats_dt.
Could not recognize genome build of:
- target_genome
These will be inferred from the data.
Sorting coordinates.
Writing in tabular format ==> /Users/alanmurphy/Downloads/test.tsv.gz
Summary statistics report:
- 11,998,590 rows (94.6% of original 12,688,339 rows)
- 11,991,381 unique variants
- 2,620 genome-wide significant variants (P<5e-8)
- 22 chromosomes
Successfully finished preparing sumstats file, preview:
Reading header.
SNP CHR BP A1 A2 Z P N
1: rs533499096 1 14860 G T -1.27587418 0.2020 74004
2: rs373918278 1 17765 G A 0.09388564 0.9252 74004
3: rs763779009 1 17767 G A -0.42614801 0.6700 74004
4: rs199900651 1 48186 T G 0.91308089 0.3612 74004
Returning path to saved data.
Awesome, thanks!
Did you have a chance to look at this as well?: https://github.com/neurogenomics/MungeSumstats/issues/99#issuecomment-1098516280
Ah nope forgot that!
Okay so that error was caused because you passed values to SNP that weren't RS IDs (don't start with rs), and didn't set snp_ids_are_rs_ids=TRUE
. So it threw an error when looking on the reference genome for them. I've added a step that if all SNP values don't start with rs id and snp_ids_are_rs_ids=FALSE
, then append rs at the start of each (i.e. "1" => "rs1" etc). This covers use if some sumstats use rs ids but don't start with rs. I've also added a warning for the user that if this happens they are told about the snp_ids_are_rs_ids input variable.
I'll push it in the next few minutes
Ah, i gotcha. But the only reason i did that was because the sumstats were missing a SNP col entirely (see the "Missing SNP col" section in my original post):
Missing SNP col
Lastly, the sumstats are missing a SNP column for some reason.... This causes the following error:
I don't think setting the snp_ids_are_rs_ids
arg would work in this case bc the error occurs at the step where required cols are checked.
Okay so that error was caused because you passed values to SNP that weren't RS IDs (don't start with rs), and didn't set
snp_ids_are_rs_ids=TRUE
. So it threw an error when looking on the reference genome for them. I've added a step that if all SNP values don't start with rs id andsnp_ids_are_rs_ids=FALSE
, then append rs at the start of each (i.e. "1" => "rs1" etc). This covers use if some sumstats use rs ids but don't start with rs. I've also added a warning for the user that if this happens they are told about the snp_ids_are_rs_ids input variable.
Sounds good to me, except I would actually use the prefix "snp" or something other than "rs" simply to make it impossible to accidentally assign a real RSID at the wrong position (which might occur in certain situations where certain checks are turned off.)
Yeah the SNP column wasn't being imputed as normal for you because the BP column also didn't exist so it couldn't use that. It runs fine now (if you want to test the latest version there just to make sure)?
Sounds good to me, except I would actually use the prefix "snp" or something other than "rs" simply to make it impossible to accidentally assign a real RSID at the wrong position (which might occur in certain situations where certain checks are turned off.)
Can't use snp or anything else as this is used to match against the reference genome so needs to be rs.
Yeah the SNP column wasn't being imputed as normal for you because the BP column also didn't exist so it couldn't use that. It runs fine now (if you want to test the latest version there just to make sure)?
OK cool, I'll confirm once my current run finishes.
Can't use snp or anything else as this is used to match against the reference genome so needs to be rs.
What about SNPs where it's simply a chr:bp string? Those aren't simply removed are they? I thought MungeSumstats imputed those. Why not do the same with dummy RSIDS with some non-"rs" prefix? Using the "snp" prefix in the dummy column seemed to work well for me.
Ah actually good point, I remember thinking this through before now.
This change won't be able to pick up SNPs where it's simply a chr:bp string (or any other columns combine which is quite common). I'm going to revert it now.
Adding a prefix like snp won't help since mungesumstats uses the SNP column to impute/check on reference genome so it will throw an error. Only way I can see around it is with the snp_ids_are_rs_ids variable so the SNP column can be renamed in these instances.
Let me know if you see a way around this I'm missing?
1. Bug description
I noticed a couple things when trying to munge the sumstats from the Posthuma 2021 AD GWAS provided here: https://ctg.cncr.nl/software/summary_statistics/
Expected behaviour
2. Reproducible example
Code
Console output
Reading remote sumstats
Looks like the remote file isn't being recognized. I confirmed it's a valid tsv by importing with
fread
manually.New column mappings
Also the header contains:
We'll need to add to the mapping files accordingly:
We already have "tested.allele" and "tested_allele", but not a version with no separator. Given the variety of separators that can be used, I think it would be worth doing some standardization in addition to making cols uppercase. That way we don't have to include every iteration of each uncorrected col in the colmap file. Something like:
Missing SNP col
Lastly, the sumstats are missing a SNP column for some reason.... This causes the following error:
Since we do some imputation of RSIDs at some point, and the genome build is known for this GWAS ("GRCh37") could we just impute the RSIDS of all SNPs and make a new column?
No need to worry about this until after Easter break, just putting here so I don't forget.
3. Session info