Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 16 forks source link

Create tabix index failed #117

Closed PhoebeGuo97 closed 1 year ago

PhoebeGuo97 commented 2 years ago

I tried munging my summary statistics using fullSS_path <- MungeSumstats::format_sumstats(path = fullSS_path, ref_genome = "GRCH37",tabix_index=TRUE). It took several days to finish. Here is what I saw in R Console: Tabix-indexing file. [ti_index_core] the file out of order at line 1253647 Create tabix index failed for [ /var/folders/g9/f5z6vsbx7q3_wmk06jz0j9zm0000gn/T//RtmpUeD8qB/filec8a3e5725dc.tsv.bgz ]! Summary statistics report:

I'm wondering how to successfully creat tabix index. Thanks!

Al-Murphy commented 2 years ago

Hard to say what has happened, have you tried to inspect the saved file at the returned path?

I believe the sumstats isn't ordered correctly given below:

Tabix-indexing file.
[ti_index_core] the file out of order at line 1253647
Create tabix index failed for [ /var/folders/g9/f5z6vsbx7q3_wmk06jz0j9zm0000gn/T//RtmpUeD8qB/filec8a3e5725dc.tsv.bgz ]!

If you could provide a small sample dataset that gives the same error I will be happy to debug.

Cheers, Alan.

PhoebeGuo97 commented 2 years ago

Hi Alan,

I was not able to open the file and I don't know if that's because it's damaged. Here is the link of the GWAS dataset I used: https://ctg.cncr.nl/documents/p1651/AD_sumstats_Jansenetal_2019sept.txt.gz

Many thanks!

Best, Phoebe

Al-Murphy commented 2 years ago

Hey!

So firstly, processing this sumstats took 23.17 minutes to run on my machine. Given it took days for you to run it I think you should look to get access to a more powerful machine for future runs.

Secondly, the issue with tabix is frustrating since it is due to an issue with seqminer::tabix.createIndex which we call to make the tabix file. The error you see is:

[ti_index_core] the file out of order at line 769472
Create tabix index failed for [ /tmp/RtmpWtdLC5/file12b21303b55d.tsv.bgz ]!

However I manually check the rows around 769472 and they are sorted correctly based on BP and CHR and they are correctly sorted:

> sumstats_dt[769471:769473,]
          SNP CHR       BP A1 A2    UNIQID.A1A2        Z          P   NSUM      N DIRECTION       FRQ        BETA          SE
1: rs61573637   2 72000000  G  A 2:72000000_A_G 1.098398 0.27203086 434286 427769      ?--+ 0.8914840 0.003817998 0.003475970
2:  rs6725984   2 72000148  A  G 2:72000148_G_A 1.717379 0.08591000 363988 363988      ??-? 0.9848168 0.016460628 0.009584740
3:  rs6546719   2 72000635  G  A 2:72000635_A_G 1.870286 0.06144409 432532 426031      ?+-- 0.5723900 0.004095443 0.002189741

We noted this issue before however it only seems to happen on some sumstats and what appears to be randomly. We have created an issue about it here: https://github.com/zhanxw/seqminer/issues/25. @bschilder have we heard anything back on this?

My advice is to run MSS with tabix_index=FALSE and then if you need to convert it to a tabix file separately after.

Thanks, Alan.

PhoebeGuo97 commented 2 years ago

Hi Alan,

I tried running tabix_index=FALSE and got a new error: Loading SNPlocs data. Error: vector memory exhausted (limit reached?) In addition: Warning message: In [.data.table(sumstats_dt, , :=((names(empty_cols)), NULL)) : Column 'NA' does not exist to remove

Could you please help me fix it? Thanks!

Best, Phoebe

Al-Murphy commented 2 years ago

Hi Phoebe,

So the vector memory exhausted is telling you that you don't have enough RAM to run the sumstats through MSS. You will need to get access to a machine with more RAM (perhaps if you are part of a university, on their HPC).

Cheers, Alan.

bschilder commented 2 years ago

seqminer issues

Regarding seqminer, I've tried reaching out to the authors multiple times through GitHub and email, but unfortunately I haven't heard anything back from them (it's been over a year now). @zhanxw @yang-lina

Given this, I have to assume thatseqminer is no longer being actively maintained, and therefore I think we should remove it from any of our packages and scripts.

I've recently implemented several alternative options for sorting/indexing tabix files within echotabix: https://github.com/RajLabMSSM/echotabix

I will create a new issue about replacing seqminer.

RAM issues

I think the main conclusion from this specific Issue is that @PhoebeGuo97 's machine doesn't have sufficient RAM. So I agree, the best solution is to get access to a machine more suitable for analysis of medium- to large-scale data.

PhoebeGuo97 commented 2 years ago

Hello, I'm wondering how to convert formatted summary statistics to a taxi file if I have to run MSS with tabix_index=FALSE. Thank you!

Al-Murphy commented 2 years ago

Hey, have a look at seqminer::tabix.createIndex() which is what MSS uses but there are other options out there for R which should be easily findable with a quick google

Al-Murphy commented 2 years ago

Hey @PhoebeGuo97, this should now be resolved and the dev version of MSS should allow tabix indexing https://github.com/neurogenomics/MungeSumstats/issues/122

PhoebeGuo97 commented 2 years ago

Hey @PhoebeGuo97, this should now be resolved and the dev version of MSS should allow tabix indexing #122

Hey @Al-Murphy, Thanks for the update. I tried re-running the MSS and got different errors :(

Tabix-indexing file. [E::hts_idx_push] Unsorted positions on sequence #2: 71999998 followed by 7 Error in value[3L] : index build failed file: /mnt/belinda_local/fangfei/data/Jansen2019/formatted_Jansen2019.tsv.bgz In addition: Warning message: In withCallingHandlers(expr, message = function(c) if (inherits(c, : NAs introduced by coercion

Al-Murphy commented 2 years ago

Hey @PhoebeGuo97,

Is this with the same sumstats you linked above? - https://ctg.cncr.nl/documents/p1651/AD_sumstats_Jansenetal_2019sept.txt.gz

Can you confirm the version of MSS you are using to get this error?

Alan.

PhoebeGuo97 commented 2 years ago

Hi @Al-Murphy,

Yes. The version of MSS is 1.5.14. Thanks!

Phoebe

Al-Murphy commented 2 years ago

@bschilder any clue why this error is occurring with the changed tabix code?

Thanks!

bschilder commented 1 year ago

Hey @Al-Murphy and @PhoebeGuo97 , sorry it took me so long to get to this. Managed to replicate the issue, just trying to figure out the source of the issue.

[E::hts_idx_push] Unsorted positions on sequence

This error usually occurs when:

  1. the variants are not in order of increasing genomic position: this should be taken care of by the sorting step in write_sumstats, handled by sort_coords. But perhaps there's some use cases where this sorting strategy is incomplete?
  2. there's duplicate positions across rows: this shouldn't be an issue since we remove non-biallelic SNPs by default, but I'll double check that there aren't any leftover duplicate positions.

Here's the reprex:

        res <- MungeSumstats::format_sumstats(path = "https://ctg.cncr.nl/documents/p1651/AD_sumstats_Jansenetal_2019sept.txt.gz",
                                       ref_genome = "GRCH37",
                                       tabix_index=FALSE)
### Fails ###

        MungeSumstats::write_sumstats(sumstats_dt = sumstats_dt, 
                                      save_path="~/Downloads/sumstats_dt.tsv.gz",
                                      tabix_index = TRUE)
# Error: [E::hts_idx_push] Unsorted positions on sequence #2: 71999998 followed by 7

### Find offending rows ###
 sumstats_dt <- data.table::fread(res)
i <- which(sumstats_dt$CHR==2 & sumstats_dt$BP==71999998)
sumstats_dt[seq(i-2,i+2)]
           SNP CHR       BP A1 A2    UNIQID.A1A2          Z         P   NSUM      N
1:   rs9808550   2 71998680  C  T 2:71998680_T_C  1.6951867 0.0900400 364123 364123
2:  rs72908622   2 71998853  C  T 2:71998853_T_C -0.5079335 0.6115000 364804 364804
3: rs181535630   2 71999998  T  C 2:71999998_C_T  0.2520531 0.8010000 364718 364718
4:  rs61573637   2 72000000  G  A 2:72000000_A_G  1.0983977 0.2720309 434286 427769
5:   rs6725984   2 72000148  A  G 2:72000148_G_A  1.7173787 0.0859100 363988 363988
   DIRECTION       FRQ         BETA          SE
1:      ??-? 0.9843086  0.015983809 0.009428937
2:      ??+? 0.9963967 -0.009924159 0.019538306
3:      ??-? 0.9986449  0.008022454 0.031828425
4:      ?--+ 0.8914840  0.003817998 0.003475970
5:      ??-? 0.9848168  0.016460628 0.009584740

I believe "sequence" refers to CHR, and numbers refer to the BP. But these positions don't seem to be out of order.

bschilder commented 1 year ago

I also checked the cdict (used within the index_tabular function) to see if perhaps that using the wrong columns during indexing. But that also seems to be fine:

cdict <- MungeSumstats:::column_dictionary(file_path = res)
cdict
     SNP         CHR          BP          A1          A2 UNIQID.A1A2           Z 
          1           2           3           4           5           6           7 
          P        NSUM           N   DIRECTION         FRQ        BETA          SE 
          8           9          10          11          12          13          14 
bschilder commented 1 year ago

Ok, so sorting using a bash command line wrapper in R (instead of data.table) seems to fix this issue:

 res_sort <- echotabix::sort_coordinates(target_path = res, 
                                                chrom_col = "CHR", 
                                                start_col = "BP",
                                                save_path = "tmp.tsv.gz")
Inferred format: 'table'
Inferring comment_char from tabular header: 'SNP'
Determining chrom type from file header.
Chromosome format: 1
Detecting column delimiter.
Identified column separator: \t
Sorting rows by coordinates via bash.
Searching for header row with zgrep.
( zgrep ^'SNP' .../filed2b063890391.tsv.gz; zgrep
    -v ^'SNP' .../filed2b063890391.tsv.gz | sort
    -k2,2n
    -k3,3n ) > tmp.tsv
|--------------------------------------------------|
|==================================================|
|--------------------------------------------------|
|==================================================|
Constructing outputs
out <- MungeSumstats::index_tabular(path = res_sort$path)
Converting full summary stats file to tabix format for fast querying...
Reading header.
Ensuring file is bgzipped.
Tabix-indexing file.
Removing temproary .tsv file.

But I still don't really know why sorting using data.table doesn't work in all scenarios. I'd prefer to avoid using system() to call command line (as echotabix::sort_coordinates does) as it introduces potential issues with cross-platform system dependencies and syntax differences.

bschilder commented 1 year ago

Just found a third option via GenomicRanges:

gr <- to_granges(sumstats_dt)
gr <- GenomicRanges::sort.GenomicRanges(gr)
sumstats_dt <- granges_to_dt(gr)

This isn't nearly as efficient as the data.table-native solution, but it's more reliable when it comes to indexing, without requiring system() calls to command line. I've added this as an option selectable with sort_coords(sort_method=) This option is more so for us if we need to change it back at some point, rather than for users (tho we can consider passing it up to exported functions as a new arg @Al-Murphy ).

bschilder commented 1 year ago

Something else to consider in for the future is dissecting the GenomicRanges::sort.GenomicRanges source code in more detail and developing a data.table-native version of whatever they're doing (to increase efficiency while retaining robustness).

bschilder commented 1 year ago

I also wonder if maybe data.table::setorderv is ignoring our factor levels?

Al-Murphy commented 1 year ago

Could be ignoring factors since factors aren't supported in data.table. What column has factors? There shouldn't be any I don't think? Might be a simple fix to just change these from factors before sorting?

bschilder commented 1 year ago

Could be ignoring factors since factors aren't supported in data.table. What column has factors? There shouldn't be any I don't think? Might be a simple fix to just change these from factors before sorting?

https://github.com/neurogenomics/MungeSumstats/blob/8915df741aba538d778a1a3e27a9695f8c35680d/R/sort_coordinates.R#L24

Didn't know factors weren't supported at all!

Al-Murphy commented 1 year ago

I think it's more that they advise against them is all. Maybe try moving that check to after sorting the order? Should cover this? Not sure of any downstream issues so worth checking?

Also just as a side note for when you are working on this, the push you made to use Rworkflows now makes checks run on Windows again which I had removed. Could you change it so it won't run on Windows again (comment out the - { os: windows-latest, r: 'latest', bioc: 'release'} line)? Also it is now failing since this change removed the increased download time

trying URL 'https://bioconductor.org/packages/3.16/data/annotation/src/contrib/SNPlocs.Hsapiens.dbSNP155.GRCh37_0.99.22.tar.gz'
Content type 'application/x-gzip' length 6049008307 bytes (5768.8 MB)
====================================
downloaded 4172.2 MB

Error in download.file(url, destfile, method, mode = "wb", ...) : 
  download from 'https://bioconductor.org/packages/3.16/data/annotation/src/contrib/SNPlocs.Hsapiens.dbSNP155.GRCh37_0.99.22.tar.gz' failed

Could you also add back in options(timeout=2000) to the checks? I put it in the install dependencies pass 1,2,3 and that seemed to do the trick

bschilder commented 1 year ago

I think it's more that they advise against them is all. Maybe try moving that check to after sorting the order? Should cover this? Not sure of any downstream issues so worth checking?

Which check do you mean?

Writing/indexing always occurs after sorting. https://github.com/neurogenomics/MungeSumstats/blob/8915df741aba538d778a1a3e27a9695f8c35680d/R/write_sumstats.R#L54

Also just as a side note for when you are working on this, the push you made to use Rworkflows now makes checks run on Windows again which I had removed. Could you change it so it won't run on Windows again (comment out the - { os: windows-latest, r: 'latest', bioc: 'release'} line)? Also it is now failing since this change removed the increased download time

Sure, will do.

Could you also add back in options(timeout=2000) to the checks? I put it in the install dependencies pass 1,2,3 and that seemed to do the trick

Where did you add this exactly? I've found four instances where this was specified: https://github.com/neurogenomics/MungeSumstats/blob/573d267132f440b02887149df4ea0c85661cf74c/.github/workflows/check-bioc-docker.yml

This would now be done at the level of the rworkflows action itself. Might be worth me adding this as an extra arg users can control tho.

Al-Murphy commented 1 year ago

Which check do you mean?

Apologies I misunderstood what the factor call was for - I implemented it so X,Y,MT chromosomes would come after the others. The column is changed back to a factor before returning. Could you share a version of the sumstats that is passed into the sort_coordinates function in the version with the error you can replicate? Obviously, it would be best to debug what's causing the issue to keep the commands within data.table without changing to genomic ranges to help with speed. If it is actually a bug with data.table I think we should raise it with them to fix rather than using the genomic ranges approach.

I also wonder if maybe data.table::setorderv is ignoring our factor levels? Are there factor columns in the dataset other than the CHR column which we create? There shouldn't be.

Thanks for sorting those two side things!!

bschilder commented 1 year ago

Cool, thanks for clarifying @Al-Murphy. I'll work on this a bit more and see if I can figure out a data.table-native solution.

bschilder commented 1 year ago

Ok, so I think I've fixed this in a way that keep the speed benefits of data.table.

There were two main issues I identified:

  1. Genomic position columns (i.e. "BP") must be integers (not numeric) or else this somehow screws up tabix during indexing. Added some code to code to ensure this before writing to disk.
  2. To more systematically ensure chromosome names are standardized, I now used the GenomeInfoDb::rankSeqlevels to identify the correct order of the "CHR" levels.

I've added a new subfunction sort_coords_datatable so we can easily swap it out in case we identify another method in the future. Keeping the alternative data.table --> granges -- sort --> data.table approach as backup alternative for now in case it's useful in the future, and have added some quick unit tests so it wont affect code coverage.

Will push these changes today.

bschilder commented 1 year ago

GHA is failing due to BSgenome currently failing on Bioc release and devel: https://github.com/neurogenomics/MungeSumstats/actions/runs/3847722518/jobs/6554542961#step:2:1226

https://bioconductor.org/packages/release/bioc/html/BSgenome.html https://bioconductor.org/packages/devel/bioc/html/BSgenome.html

Will keep an eye on BSgenome and rerun when it's fixed (hopefully soon).

Al-Murphy commented 1 year ago

Great thanks Brian! Let me know how that goes with BSgenome

bschilder commented 1 year ago

Found out the full story about these deps not being available from Herve, who kindly took the time to explain it! https://github.com/Bioconductor/GenomicRanges/pull/73#issuecomment-1373070227

They should all be back working v soon.

Al-Murphy commented 1 year ago

@bschilder tabix indexing still seems to be failing in 1.7.13 (dev bioconductor):

> fullSS_path <- "~/Downloads/AD_sumstats_Jansenetal_2019sept.txt.gz"
> fullSS_path <- "~/Downloads/AD_sumstats_Jansenetal_2019sept.txt.gz"
> fullSS_path <- MungeSumstats::format_sumstats(path = fullSS_path, 
+                                               ref_genome = "GRCH37",
+                                               dbSNP = 144,
+                                               tabix_index=TRUE)
******::NOTE::******
 - Formatted results will be saved to `tempdir()` by default.
 - This means all formatted summary stats will be deleted upon ending the R session.
 - To keep formatted summary stats, change `save_path`  ( e.g. `save_path=file.path('./formatted',basename(path))` ),   or make sure to copy files elsewhere after processing  ( e.g. `file.copy(save_path, './formatted/' )`.
 ******************** 

save_path suggests .gz output but tabix_index=TRUE Switching output to tabix-indexed format (.bgz).
Formatted summary statistics will be saved to ==>  /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//RtmpeSxzIN/filea17e476cac2a.tsv.bgz
Importing tabular file: ~/Downloads/AD_sumstats_Jansenetal_2019sept.txt.gz
|--------------------------------------------------|
|==================================================|
Checking for empty columns.
Standardising column headers.
First line of summary statistics file: 
uniqID.a1a2 CHR BP  A1  A2  SNP Z   P   Nsum    Neff    dir EAF BETA    SE  
Summary statistics report:
   - 13,367,299 rows
   - 13,354,030 unique variants
   - 2,394 genome-wide significant variants (P<5e-8)
   - 22 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
11,599 SNP IDs are not correctly formatted. These will be corrected from the reference genome.
Loading SNPlocs data.
165,861 SNP IDs appear to be made up of chr:bp, these will be replaced by their SNP ID from the reference genome
Loading SNPlocs data.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 13,307,214 SNPs using BSgenome::snpsById...
Loading required package: BiocGenerics

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, aperm, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval,
    evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget, order,
    paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
    table, tapply, union, unique, unsplit, which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:base’:

    expand.grid, I, unname

BSgenome::snpsById done in 118 seconds.
1,161,417 SNPs are not on the reference genome. These will be corrected from the reference genome.
Loading SNPlocs data.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 12,315,061 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 146 seconds.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
There are 167,501 SNPs where neither A1 nor A2 match the reference genome. These will be removed.
There are 9,761,158 SNPs where A1 doesn't match the reference genome.
These will be flipped with their effect columns.
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Checking for missing data.
WARNING: 49,856 rows in sumstats file are missing data and will be removed.
Checking for duplicate columns.
Ensuring that the N column is all integers.
The sumstats N column is not all integers, this could effect downstream analysis. These will be converted to integers.
Checking for duplicate SNPs from SNP ID.
7,019 RSIDs are duplicated in the sumstats file. These duplicates will be removed
Checking for SNPs with duplicated base-pair positions.
93 base-pair positions are duplicated in the sumstats file. These duplicates will be removed.
INFO column not available. Skipping INFO score filtering step.
Filtering SNPs, ensuring SE>0.
Ensuring all SNPs have N<5 std dev above mean.
Removing 'chr' prefix from CHR.
Making X/Y/MT CHR uppercase.
Checking for bi-allelic SNPs.
332,624 SNPs are non-biallelic. These will be removed.
N already exists within sumstats_dt.
9,454,492 SNPs (80.3%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
The FRQ column was mapped from one of the following from the inputted  summary statistics file:
FRQ, EAF, FREQUENCY, FRQ_U, F_U, MAF, FREQ, FREQ_TESTED_ALLELE, FRQ_TESTED_ALLELE, FREQ_EFFECT_ALLELE, FRQ_EFFECT_ALLELE, EFFECT_ALLELE_FREQUENCY, EFFECT_ALLELE_FREQ, EFFECT_ALLELE_FRQ, A1FREQ, A1FRQ, A2FREQ, A2FRQ, ALLELE_FREQUENCY, ALLELE_FREQ, ALLELE_FRQ, AF, MINOR_AF, EFFECT_AF, A2_AF, EFF_AF, ALT_AF, ALTERNATIVE_AF, INC_AF, A_2_AF, TESTED_AF, AF1, ALLELEFREQ, ALT_FREQ, EAF_HRC, EFFECTALLELEFREQ, FREQ.A1.1000G.EUR, FREQ.A1.ESP.EUR, FREQ.ALLELE1.HAPMAPCEU, FREQ.B, FREQ1, FREQ1.HAPMAP, FREQ_EUROPEAN_1000GENOMES, FREQ_HAPMAP, FREQ_TESTED_ALLELE_IN_HRS, FRQ_A1, FRQ_U_113154, FRQ_U_31358, FRQ_U_344901, FRQ_U_43456, POOLED_ALT_AF, AF_ALT, AF.ALT, AF-ALT, ALT.AF, ALT-AF, A2.AF, A2-AF, AF.EFF, AF_EFF, AF_EFF
As frq_is_maf=TRUE, the FRQ column will not be renamed. If the FRQ values were intended to represent major allele frequency,
set frq_is_maf=FALSE to rename the column as MAJOR_ALLELE_FRQ and differentiate it from minor/effect allele frequency.
Sorting coordinates.
Sorting coordinates.
Writing in tabular format ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//RtmpeSxzIN/filea17e476cac2a.tsv
Writing uncomressed instead of gzipped to enable index
Converting full summary stats file to tabix format for fast querying...                                                              
Reading header.
Ensuring file is bgzipped.
Tabix-indexing file.
[E::hts_idx_push] Unsorted positions on sequence #2: 71999998 followed by 7
Error in value[[3L]](cond) : index build failed
  file: /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//RtmpeSxzIN/filea17e476cac2a.tsv.bgz
In addition: Warning message:
In withCallingHandlers(expr, message = function(c) if (inherits(c,  :
  NAs introduced by coercion
bschilder commented 1 year ago

@Al-Murphy this now seems to be working consistently for me. Some of the dep source code may have been fixed since you last checked.

Al-Murphy commented 1 year ago

Great closing for now, we can reopen if it happens again

anbai106 commented 1 year ago

@Al-Murphy

I have the same issue with version (1.2.4) using the example data that you provide here: https://bioconductor.org/packages/release/bioc/vignettes/MungeSumstats/inst/doc/MungeSumstats.html

library(MungeSumstats)

eduAttainOkbayPth <- system.file("extdata", "eduAttainOkbay.txt", package = "MungeSumstats") formatted_path <- "/home/hao/Downloads/eduAttainOkbay_standardised.tsv.gz"

1. Read in the data and standardise header names

dat <- MungeSumstats::read_sumstats(path = eduAttainOkbayPth, standardise_headers = TRUE)

2. Write to disk as a compressed, tab-delimited, tabix-indexed file

formatted_path <- MungeSumstats::write_sumstats(sumstats_dt = dat, save_path = formatted_path, tabix_index = TRUE, write_vcf = FALSE,

                                            return_path = TRUE) 

Output is here:

Reading header. Tabular format detected. Importing tabular file: /home/hao/R/x86_64-pc-linux-gnu-library/4.1/MungeSumstats/extdata/eduAttainOkbay.txt Standardising column headers. First line of summary statistics file: MarkerName CHR POS A1 A2 EAF Beta SE Pval
Writing in tabular format ==> /home/hao/Downloads/eduAttainOkbay_standardised.tsv.gz Converting full summary stats file to tabix format for fast querying... Reading header. Ensuring file is bgzipped. Tabix-indexing file. [ti_index_core] the file out of order at line 5 Create tabix index failed for [ /home/hao/Downloads/eduAttainOkbay_standardised.tsv.bgz ]!

Do we have a solution for this? Because I need to convert my tab-delimiter GWAS summary into the tabix-index file

My Bioconductor version (Bioconductor version 3.14 (BiocManager 1.30.21), R 4.1.2 (2021-11-01))

Al-Murphy commented 1 year ago

Hey @anbai106,

MSS version 1.2 is now very old and there has been a lot of changes between that and the current release version 1.6.0.

Could you please try updating to the current release and see if this resolves your issue?

Cheers, Alan.

anbai106 commented 1 year ago

@Al-Murphy

Thanks for your reply. Somehow, I followed your tutorial to install the package, but the version that I got is 1.2.4:

Screenshot from 2023-06-23 09-06-48

I don't know if this is because of my R and Bioconductor versions, but it seems ok

Al-Murphy commented 1 year ago

Yep it's because of your R version, you need to install at least R v4.3.0

anbai106 commented 1 year ago

@Al-Murphy

Thanks! I will try this wonderful package later because I don't want to upgrade my R version for now.

A side point: R and Rstudio should allow users to easily switch different versions of R on Linux machines, like Python/Pycharms, so that we do not need to worry much about incompatibility across different R packages due to version issues.

Best regards

bschilder commented 1 year ago

Hi @anbai106, while it does take a couple extra steps, you can have multiple versions of R/Rstudio installed on your computer at once by making containers: https://neurogenomics.github.io/MungeSumstats/articles/docker.html