Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 16 forks source link

Error in `[.data.table`(sumstats_dt, rsids, `:=`(SNP, i.RefSNP_id)) : 当删除列时,不应指定 i #163

Closed x-level closed 1 year ago

x-level commented 1 year ago

when i use the package and try to translate the chr:pos into rsid,it show such bug code. but it is not gett such bugs everytimes.maybe the raw gwas data cause?

/
Error in `[.data.table`(sumstats_dt, rsids, `:=`(SNP, i.RefSNP_id)) : 
  当删除列时,不应指定 i
Al-Murphy commented 1 year ago

Hi,

Can you please provide code and a dataset that replicates the issue you are experiencing so I can try debug it? Also please let me know what version of MSS you are using.

Thanks, Alan.

x-level commented 1 year ago
library(data.table)
T_CH<-fread(file = "leptin——GCST90007307_buildGRCh37.tsv",header = T)
T_CH<- subset(T_CH, 
             P < 5E-08)
write.csv(T_CH, file="T_CH.csv")

T_CH <-read.csv("T_CH.csv", header = TRUE)
library(BSgenome.Hsapiens.1000genomes.hs37d5)
library(SNPlocs.Hsapiens.dbSNP144.GRCh37)
library(MungeSumstats)
T_CH <- format_sumstats(T_CH, dbSNP = "144",ref_genome = "GRCh37", nThread = 4,return_data = TRUE)

version of MSS is 1.8.0,and the data comes from GWAS CATLOG .Although some of information miss up, after i delete the row of which information missed ,it still show such problem,

Al-Murphy commented 1 year ago

You marked this as completed is your issue resolved?

If not you will need to attach leptin——GCST90007307_buildGRCh37.tsv or T_CH.csv or link to where it can be downloaded from so I can test

x-level commented 1 year ago

sorry.i just click the wrong button. You can enter that string of numbers on the gwas catalog website to obtain information. http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90007001-GCST90008000/GCST90007307/GCST90007307_buildGRCh37.tsv

Al-Murphy commented 1 year ago

I ran the code with the downloaded data but didn't get an error (see output message below). Could you try installing the dev version of MSS (1.9.15) directly from github and see if the error goes away (devtools::install_github("https://github.com/neurogenomics/MungeSumstats"))? This is the version I'm using.

Output:

> T_CH <- format_sumstats(T_CH, dbSNP = "144",
+                         ref_genome = "GRCh37", 
+                         nThread = 4,
+                         return_data = TRUE)

******::NOTE::******
 - Formatted results will be saved to `tempdir()` by default.
 - This means all formatted summary stats will be deleted upon ending the R session.
 - To keep formatted summary stats, change `save_path`  ( e.g. `save_path=file.path('./formatted',basename(path))` ),   or make sure to copy files elsewhere after processing  ( e.g. `file.copy(save_path, './formatted/' )`.
 ******************** 

Formatted summary statistics will be saved to ==>  /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//RtmpYzKmoQ/file1227c432b1b8c.tsv.gz
Standardising column headers.
First line of summary statistics file: 
chromosome  base_pair_location  other_allele    effect_allele   effect_allele_frequency beta    standard_error  p_value sample_size 
Summary statistics report:
   - 232,255 rows
   - 19 genome-wide significant variants (P<5e-8)
   - 22 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Checking for incorrect base-pair positions
Loading SNPlocs data.
There is no SNP column found within the data. It must be inferred from other column information.
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 228,685 SNPs using BSgenome::snpsById...
Loading required package: BiocGenerics

Attaching package: ‘BiocGenerics’

The following object is masked from ‘package:MungeSumstats’:

    sd

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, aperm, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl,
    intersect, is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply,
    setdiff, sort, table, tapply, union, unique, unsplit, which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:data.table’:

    first, second

The following object is masked from ‘package:utils’:

    findMatches

The following objects are masked from ‘package:base’:

    expand.grid, I, unname

Attaching package: ‘IRanges’

The following object is masked from ‘package:data.table’:

    shift

The following object is masked from ‘package:MungeSumstats’:

    desc

BSgenome::snpsById done in 23 seconds.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
Checking for missing data.
WARNING: 76 rows in sumstats file are missing data and will be removed.
Checking for duplicate columns.
Checking for duplicate SNPs from SNP ID.
Checking for SNPs with duplicated base-pair positions.
INFO column not available. Skipping INFO score filtering step.
Filtering SNPs, ensuring SE>0.
Ensuring all SNPs have N<5 std dev above mean.
Checking for bi-allelic SNPs.
29,308 SNPs are non-biallelic. These will be removed.
8,126 SNPs (4.1%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
The FRQ column was mapped from one of the following from the inputted  summary statistics file:
FRQ, EAF, FREQUENCY, FRQ_U, F_U, MAF, FREQ, FREQ_TESTED_ALLELE, FRQ_TESTED_ALLELE, FREQ_EFFECT_ALLELE, FRQ_EFFECT_ALLELE, EFFECT_ALLELE_FREQUENCY, EFFECT_ALLELE_FREQ, EFFECT_ALLELE_FRQ, A1FREQ, A1FRQ, A2FREQ, A2FRQ, ALLELE_FREQUENCY, ALLELE_FREQ, ALLELE_FRQ, AF, MINOR_AF, EFFECT_AF, A2_AF, EFF_AF, ALT_AF, ALTERNATIVE_AF, INC_AF, A_2_AF, TESTED_AF, AF1, ALLELEFREQ, ALT_FREQ, EAF_HRC, EFFECTALLELEFREQ, FREQ.A1.1000G.EUR, FREQ.A1.ESP.EUR, FREQ.ALLELE1.HAPMAPCEU, FREQ.B, FREQ1, FREQ1.HAPMAP, FREQ_EUROPEAN_1000GENOMES, FREQ_HAPMAP, FREQ_TESTED_ALLELE_IN_HRS, FRQ_A1, FRQ_U_113154, FRQ_U_31358, FRQ_U_344901, FRQ_U_43456, POOLED_ALT_AF, AF_ALT, AF.ALT, AF-ALT, ALT.AF, ALT-AF, A2.AF, A2-AF, AF.EFF, AF_EFF, AF_EFF
As frq_is_maf=TRUE, the FRQ column will not be renamed. If the FRQ values were intended to represent major allele frequency,
set frq_is_maf=FALSE to rename the column as MAJOR_ALLELE_FRQ and differentiate it from minor/effect allele frequency.
Sorting coordinates with 'data.table'.
Writing in tabular format ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//RtmpYzKmoQ/file1227c432b1b8c.tsv.gz
Summary statistics report:
   - 199,301 rows (85.8% of original 232,255 rows)
   - 199,301 unique variants
   - 18 genome-wide significant variants (P<5e-8)
   - 22 chromosomes
Done munging in 1.283 minutes.
Successfully finished preparing sumstats file, preview:
Reading header.
           SNP CHR     BP A1 A2          FRQ        BETA         SE          P     N
1:  rs75333668   1 762320  C  T 0.0105762140 -0.05264141 0.03411245 0.12278880 47330
2: rs200686669   1 861349  C  T 0.0003469262 -0.31162905 0.16523437 0.05929739 50386
3: rs201186828   1 865545  G  A 0.0015638357  0.10247036 0.09582466 0.28491079 35561
4: rs148711625   1 865584  G  A 0.0023884213 -0.10452447 0.06684452 0.11788909 49172
Returning data directly.
Warning messages:
1: replacing previous import ‘utils::findMatches’ by ‘S4Vectors::findMatches’ when loading ‘SNPlocs.Hsapiens.dbSNP144.GRCh37’ 
2: In eval(jsub, SDenv, parent.frame()) : NAs introduced by coercion
3: package ‘IRanges’ was built under R version 4.3.1 
4: package ‘GenomeInfoDb’ was built under R version 4.3.1 
x-level commented 1 year ago

I update the version of MSS to 1.9.15,but it still work error same as before.And i try to re-install the "data.table"package,but it does not work

Formatted summary statistics will be saved to ==>  C:\Users\king\AppData\Local\Temp\Rtmp4k3dwN\file26e82d263781.tsv.gz
Standardising column headers.
First line of summary statistics file: 
X   chromosome  base_pair_location  other_allele    effect_allele   effect_allele_frequency beta    standard_error  p_value sample_size 
Summary statistics report:
   - 22 rows
   - 19 genome-wide significant variants (P<5e-8)
   - 4 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Checking for incorrect base-pair positions
Loading SNPlocs data.
There is no SNP column found within the data. It must be inferred from other column information.
Error in `[.data.table`(sumstats_dt, rsids, `:=`(SNP, i.RefSNP_id)) : 
  当删除列时,不应指定 i
In addition: Warning message:
replacing previous import ‘utils::findMatches’ by ‘S4Vectors::findMatches’ when loading ‘SNPlocs.Hsapiens.dbSNP155.GRCh37’
x-level commented 1 year ago

I ran the code with the downloaded data but didn't get an error (see output message below). Could you try installing the dev version of MSS (1.9.15) directly from github and see if the error goes away (devtools::install_github("https://github.com/neurogenomics/MungeSumstats"))? This is the version I'm using.

Output:

> T_CH <- format_sumstats(T_CH, dbSNP = "144",
+                         ref_genome = "GRCh37", 
+                         nThread = 4,
+                         return_data = TRUE)

******::NOTE::******
 - Formatted results will be saved to `tempdir()` by default.
 - This means all formatted summary stats will be deleted upon ending the R session.
 - To keep formatted summary stats, change `save_path`  ( e.g. `save_path=file.path('./formatted',basename(path))` ),   or make sure to copy files elsewhere after processing  ( e.g. `file.copy(save_path, './formatted/' )`.
 ******************** 

Formatted summary statistics will be saved to ==>  /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//RtmpYzKmoQ/file1227c432b1b8c.tsv.gz
Standardising column headers.
First line of summary statistics file: 
chromosome    base_pair_location  other_allele    effect_allele   effect_allele_frequency beta    standard_error  p_value sample_size 
Summary statistics report:
   - 232,255 rows
   - 19 genome-wide significant variants (P<5e-8)
   - 22 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Checking for incorrect base-pair positions
Loading SNPlocs data.
There is no SNP column found within the data. It must be inferred from other column information.
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 228,685 SNPs using BSgenome::snpsById...
Loading required package: BiocGenerics

Attaching package: ‘BiocGenerics’

The following object is masked from ‘package:MungeSumstats’:

    sd

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, aperm, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl,
    intersect, is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply,
    setdiff, sort, table, tapply, union, unique, unsplit, which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:data.table’:

    first, second

The following object is masked from ‘package:utils’:

    findMatches

The following objects are masked from ‘package:base’:

    expand.grid, I, unname

Attaching package: ‘IRanges’

The following object is masked from ‘package:data.table’:

    shift

The following object is masked from ‘package:MungeSumstats’:

    desc

BSgenome::snpsById done in 23 seconds.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
Checking for missing data.
WARNING: 76 rows in sumstats file are missing data and will be removed.
Checking for duplicate columns.
Checking for duplicate SNPs from SNP ID.
Checking for SNPs with duplicated base-pair positions.
INFO column not available. Skipping INFO score filtering step.
Filtering SNPs, ensuring SE>0.
Ensuring all SNPs have N<5 std dev above mean.
Checking for bi-allelic SNPs.
29,308 SNPs are non-biallelic. These will be removed.
8,126 SNPs (4.1%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
The FRQ column was mapped from one of the following from the inputted  summary statistics file:
FRQ, EAF, FREQUENCY, FRQ_U, F_U, MAF, FREQ, FREQ_TESTED_ALLELE, FRQ_TESTED_ALLELE, FREQ_EFFECT_ALLELE, FRQ_EFFECT_ALLELE, EFFECT_ALLELE_FREQUENCY, EFFECT_ALLELE_FREQ, EFFECT_ALLELE_FRQ, A1FREQ, A1FRQ, A2FREQ, A2FRQ, ALLELE_FREQUENCY, ALLELE_FREQ, ALLELE_FRQ, AF, MINOR_AF, EFFECT_AF, A2_AF, EFF_AF, ALT_AF, ALTERNATIVE_AF, INC_AF, A_2_AF, TESTED_AF, AF1, ALLELEFREQ, ALT_FREQ, EAF_HRC, EFFECTALLELEFREQ, FREQ.A1.1000G.EUR, FREQ.A1.ESP.EUR, FREQ.ALLELE1.HAPMAPCEU, FREQ.B, FREQ1, FREQ1.HAPMAP, FREQ_EUROPEAN_1000GENOMES, FREQ_HAPMAP, FREQ_TESTED_ALLELE_IN_HRS, FRQ_A1, FRQ_U_113154, FRQ_U_31358, FRQ_U_344901, FRQ_U_43456, POOLED_ALT_AF, AF_ALT, AF.ALT, AF-ALT, ALT.AF, ALT-AF, A2.AF, A2-AF, AF.EFF, AF_EFF, AF_EFF
As frq_is_maf=TRUE, the FRQ column will not be renamed. If the FRQ values were intended to represent major allele frequency,
set frq_is_maf=FALSE to rename the column as MAJOR_ALLELE_FRQ and differentiate it from minor/effect allele frequency.
Sorting coordinates with 'data.table'.
Writing in tabular format ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//RtmpYzKmoQ/file1227c432b1b8c.tsv.gz
Summary statistics report:
   - 199,301 rows (85.8% of original 232,255 rows)
   - 199,301 unique variants
   - 18 genome-wide significant variants (P<5e-8)
   - 22 chromosomes
Done munging in 1.283 minutes.
Successfully finished preparing sumstats file, preview:
Reading header.
           SNP CHR     BP A1 A2          FRQ        BETA         SE          P     N
1:  rs75333668   1 762320  C  T 0.0105762140 -0.05264141 0.03411245 0.12278880 47330
2: rs200686669   1 861349  C  T 0.0003469262 -0.31162905 0.16523437 0.05929739 50386
3: rs201186828   1 865545  G  A 0.0015638357  0.10247036 0.09582466 0.28491079 35561
4: rs148711625   1 865584  G  A 0.0023884213 -0.10452447 0.06684452 0.11788909 49172
Returning data directly.
Warning messages:
1: replacing previous import ‘utils::findMatches’ by ‘S4Vectors::findMatches’ when loading ‘SNPlocs.Hsapiens.dbSNP144.GRCh37’ 
2: In eval(jsub, SDenv, parent.frame()) : NAs introduced by coercion
3: package ‘IRanges’ was built under R version 4.3.1 
4: package ‘GenomeInfoDb’ was built under R version 4.3.1 

tanks for your kindness and patience.After changing another computer,i run the code successfully.It should be the version problem of R or the RSTUDIO.

Al-Murphy commented 1 year ago

Great, glad you got sorted!