Closed PhoebeGuo97 closed 1 year ago
Hard to say what has happened, have you tried to inspect the saved file at the returned path?
I believe the sumstats isn't ordered correctly given below:
Tabix-indexing file.
[ti_index_core] the file out of order at line 1253647
Create tabix index failed for [ /var/folders/g9/f5z6vsbx7q3_wmk06jz0j9zm0000gn/T//RtmpUeD8qB/filec8a3e5725dc.tsv.bgz ]!
If you could provide a small sample dataset that gives the same error I will be happy to debug.
Cheers, Alan.
Hi Alan,
I was not able to open the file and I don't know if that's because it's damaged. Here is the link of the GWAS dataset I used: https://ctg.cncr.nl/documents/p1651/AD_sumstats_Jansenetal_2019sept.txt.gz
Many thanks!
Best, Phoebe
Hey!
So firstly, processing this sumstats took 23.17 minutes to run on my machine. Given it took days for you to run it I think you should look to get access to a more powerful machine for future runs.
Secondly, the issue with tabix is frustrating since it is due to an issue with seqminer::tabix.createIndex which we call to make the tabix file. The error you see is:
[ti_index_core] the file out of order at line 769472
Create tabix index failed for [ /tmp/RtmpWtdLC5/file12b21303b55d.tsv.bgz ]!
However I manually check the rows around 769472 and they are sorted correctly based on BP and CHR and they are correctly sorted:
> sumstats_dt[769471:769473,]
SNP CHR BP A1 A2 UNIQID.A1A2 Z P NSUM N DIRECTION FRQ BETA SE
1: rs61573637 2 72000000 G A 2:72000000_A_G 1.098398 0.27203086 434286 427769 ?--+ 0.8914840 0.003817998 0.003475970
2: rs6725984 2 72000148 A G 2:72000148_G_A 1.717379 0.08591000 363988 363988 ??-? 0.9848168 0.016460628 0.009584740
3: rs6546719 2 72000635 G A 2:72000635_A_G 1.870286 0.06144409 432532 426031 ?+-- 0.5723900 0.004095443 0.002189741
We noted this issue before however it only seems to happen on some sumstats and what appears to be randomly. We have created an issue about it here: https://github.com/zhanxw/seqminer/issues/25. @bschilder have we heard anything back on this?
My advice is to run MSS with tabix_index=FALSE
and then if you need to convert it to a tabix file separately after.
Thanks, Alan.
Hi Alan,
I tried running tabix_index=FALSE and got a new error:
Loading SNPlocs data.
Error: vector memory exhausted (limit reached?)
In addition: Warning message:
In [.data.table
(sumstats_dt, , :=
((names(empty_cols)), NULL)) :
Column 'NA' does not exist to remove
Could you please help me fix it? Thanks!
Best, Phoebe
Hi Phoebe,
So the vector memory exhausted is telling you that you don't have enough RAM to run the sumstats through MSS. You will need to get access to a machine with more RAM (perhaps if you are part of a university, on their HPC).
Cheers, Alan.
seqminer
issuesRegarding seqminer
, I've tried reaching out to the authors multiple times through GitHub and email, but unfortunately I haven't heard anything back from them (it's been over a year now). @zhanxw @yang-lina
Given this, I have to assume thatseqminer
is no longer being actively maintained, and therefore I think we should remove it from any of our packages and scripts.
I've recently implemented several alternative options for sorting/indexing tabix files within echotabix
:
https://github.com/RajLabMSSM/echotabix
I will create a new issue about replacing seqminer
.
I think the main conclusion from this specific Issue is that @PhoebeGuo97 's machine doesn't have sufficient RAM. So I agree, the best solution is to get access to a machine more suitable for analysis of medium- to large-scale data.
Hello, I'm wondering how to convert formatted summary statistics to a taxi file if I have to run MSS with tabix_index=FALSE. Thank you!
Hey, have a look at seqminer::tabix.createIndex() which is what MSS uses but there are other options out there for R which should be easily findable with a quick google
Hey @PhoebeGuo97, this should now be resolved and the dev version of MSS should allow tabix indexing https://github.com/neurogenomics/MungeSumstats/issues/122
Hey @PhoebeGuo97, this should now be resolved and the dev version of MSS should allow tabix indexing #122
Hey @Al-Murphy, Thanks for the update. I tried re-running the MSS and got different errors :(
Tabix-indexing file. [E::hts_idx_push] Unsorted positions on sequence #2: 71999998 followed by 7 Error in value[3L] : index build failed file: /mnt/belinda_local/fangfei/data/Jansen2019/formatted_Jansen2019.tsv.bgz In addition: Warning message: In withCallingHandlers(expr, message = function(c) if (inherits(c, : NAs introduced by coercion
Hey @PhoebeGuo97,
Is this with the same sumstats you linked above? - https://ctg.cncr.nl/documents/p1651/AD_sumstats_Jansenetal_2019sept.txt.gz
Can you confirm the version of MSS you are using to get this error?
Alan.
Hi @Al-Murphy,
Yes. The version of MSS is 1.5.14. Thanks!
Phoebe
@bschilder any clue why this error is occurring with the changed tabix code?
Thanks!
Hey @Al-Murphy and @PhoebeGuo97 , sorry it took me so long to get to this. Managed to replicate the issue, just trying to figure out the source of the issue.
[E::hts_idx_push] Unsorted positions on sequence
This error usually occurs when:
write_sumstats
, handled by sort_coords
. But perhaps there's some use cases where this sorting strategy is incomplete?Here's the reprex:
res <- MungeSumstats::format_sumstats(path = "https://ctg.cncr.nl/documents/p1651/AD_sumstats_Jansenetal_2019sept.txt.gz",
ref_genome = "GRCH37",
tabix_index=FALSE)
### Fails ###
MungeSumstats::write_sumstats(sumstats_dt = sumstats_dt,
save_path="~/Downloads/sumstats_dt.tsv.gz",
tabix_index = TRUE)
# Error: [E::hts_idx_push] Unsorted positions on sequence #2: 71999998 followed by 7
### Find offending rows ###
sumstats_dt <- data.table::fread(res)
i <- which(sumstats_dt$CHR==2 & sumstats_dt$BP==71999998)
sumstats_dt[seq(i-2,i+2)]
SNP CHR BP A1 A2 UNIQID.A1A2 Z P NSUM N
1: rs9808550 2 71998680 C T 2:71998680_T_C 1.6951867 0.0900400 364123 364123
2: rs72908622 2 71998853 C T 2:71998853_T_C -0.5079335 0.6115000 364804 364804
3: rs181535630 2 71999998 T C 2:71999998_C_T 0.2520531 0.8010000 364718 364718
4: rs61573637 2 72000000 G A 2:72000000_A_G 1.0983977 0.2720309 434286 427769
5: rs6725984 2 72000148 A G 2:72000148_G_A 1.7173787 0.0859100 363988 363988
DIRECTION FRQ BETA SE
1: ??-? 0.9843086 0.015983809 0.009428937
2: ??+? 0.9963967 -0.009924159 0.019538306
3: ??-? 0.9986449 0.008022454 0.031828425
4: ?--+ 0.8914840 0.003817998 0.003475970
5: ??-? 0.9848168 0.016460628 0.009584740
I believe "sequence" refers to CHR, and numbers refer to the BP. But these positions don't seem to be out of order.
I also checked the cdict
(used within the index_tabular
function) to see if perhaps that using the wrong columns during indexing. But that also seems to be fine:
cdict <- MungeSumstats:::column_dictionary(file_path = res)
cdict
SNP CHR BP A1 A2 UNIQID.A1A2 Z
1 2 3 4 5 6 7
P NSUM N DIRECTION FRQ BETA SE
8 9 10 11 12 13 14
Ok, so sorting using a bash command line wrapper in R (instead of data.table
) seems to fix this issue:
res_sort <- echotabix::sort_coordinates(target_path = res,
chrom_col = "CHR",
start_col = "BP",
save_path = "tmp.tsv.gz")
Inferred format: 'table'
Inferring comment_char from tabular header: 'SNP'
Determining chrom type from file header.
Chromosome format: 1
Detecting column delimiter.
Identified column separator: \t
Sorting rows by coordinates via bash.
Searching for header row with zgrep.
( zgrep ^'SNP' .../filed2b063890391.tsv.gz; zgrep
-v ^'SNP' .../filed2b063890391.tsv.gz | sort
-k2,2n
-k3,3n ) > tmp.tsv
|--------------------------------------------------|
|==================================================|
|--------------------------------------------------|
|==================================================|
Constructing outputs
out <- MungeSumstats::index_tabular(path = res_sort$path)
Converting full summary stats file to tabix format for fast querying...
Reading header.
Ensuring file is bgzipped.
Tabix-indexing file.
Removing temproary .tsv file.
But I still don't really know why sorting using data.table
doesn't work in all scenarios. I'd prefer to avoid using system()
to call command line (as echotabix::sort_coordinates
does) as it introduces potential issues with cross-platform system dependencies and syntax differences.
Just found a third option via GenomicRanges
:
gr <- to_granges(sumstats_dt)
gr <- GenomicRanges::sort.GenomicRanges(gr)
sumstats_dt <- granges_to_dt(gr)
This isn't nearly as efficient as the data.table
-native solution, but it's more reliable when it comes to indexing, without requiring system()
calls to command line. I've added this as an option selectable with sort_coords(sort_method=)
This option is more so for us if we need to change it back at some point, rather than for users (tho we can consider passing it up to exported functions as a new arg @Al-Murphy ).
Something else to consider in for the future is dissecting the GenomicRanges::sort.GenomicRanges
source code in more detail and developing a data.table
-native version of whatever they're doing (to increase efficiency while retaining robustness).
I also wonder if maybe data.table::setorderv
is ignoring our factor levels?
Could be ignoring factors since factors aren't supported in data.table. What column has factors? There shouldn't be any I don't think? Might be a simple fix to just change these from factors before sorting?
Could be ignoring factors since factors aren't supported in data.table. What column has factors? There shouldn't be any I don't think? Might be a simple fix to just change these from factors before sorting?
Didn't know factors weren't supported at all!
I think it's more that they advise against them is all. Maybe try moving that check to after sorting the order? Should cover this? Not sure of any downstream issues so worth checking?
Also just as a side note for when you are working on this, the push you made to use Rworkflows now makes checks run on Windows again which I had removed. Could you change it so it won't run on Windows again (comment out the - { os: windows-latest, r: 'latest', bioc: 'release'}
line)? Also it is now failing since this change removed the increased download time
trying URL 'https://bioconductor.org/packages/3.16/data/annotation/src/contrib/SNPlocs.Hsapiens.dbSNP155.GRCh37_0.99.22.tar.gz'
Content type 'application/x-gzip' length 6049008307 bytes (5768.8 MB)
====================================
downloaded 4172.2 MB
Error in download.file(url, destfile, method, mode = "wb", ...) :
download from 'https://bioconductor.org/packages/3.16/data/annotation/src/contrib/SNPlocs.Hsapiens.dbSNP155.GRCh37_0.99.22.tar.gz' failed
Could you also add back in options(timeout=2000)
to the checks? I put it in the install dependencies pass 1,2,3 and that seemed to do the trick
I think it's more that they advise against them is all. Maybe try moving that check to after sorting the order? Should cover this? Not sure of any downstream issues so worth checking?
Which check do you mean?
Writing/indexing always occurs after sorting. https://github.com/neurogenomics/MungeSumstats/blob/8915df741aba538d778a1a3e27a9695f8c35680d/R/write_sumstats.R#L54
Also just as a side note for when you are working on this, the push you made to use Rworkflows now makes checks run on Windows again which I had removed. Could you change it so it won't run on Windows again (comment out the
- { os: windows-latest, r: 'latest', bioc: 'release'}
line)? Also it is now failing since this change removed the increased download time
Sure, will do.
Could you also add back in
options(timeout=2000)
to the checks? I put it in the install dependencies pass 1,2,3 and that seemed to do the trick
Where did you add this exactly?
I've found four instances where this was specified:
https://github.com/neurogenomics/MungeSumstats/blob/573d267132f440b02887149df4ea0c85661cf74c/.github/workflows/check-bioc-docker.yml
This would now be done at the level of the rworkflows action itself. Might be worth me adding this as an extra arg users can control tho.
Which check do you mean?
Apologies I misunderstood what the factor call was for - I implemented it so X,Y,MT chromosomes would come after the others. The column is changed back to a factor before returning. Could you share a version of the sumstats that is passed into the sort_coordinates function in the version with the error you can replicate? Obviously, it would be best to debug what's causing the issue to keep the commands within data.table without changing to genomic ranges to help with speed. If it is actually a bug with data.table I think we should raise it with them to fix rather than using the genomic ranges approach.
I also wonder if maybe data.table::setorderv is ignoring our factor levels? Are there factor columns in the dataset other than the CHR column which we create? There shouldn't be.
Thanks for sorting those two side things!!
Cool, thanks for clarifying @Al-Murphy. I'll work on this a bit more and see if I can figure out a data.table
-native solution.
Ok, so I think I've fixed this in a way that keep the speed benefits of data.table
.
There were two main issues I identified:
GenomeInfoDb::rankSeqlevels
to identify the correct order of the "CHR" levels.I've added a new subfunction sort_coords_datatable
so we can easily swap it out in case we identify another method in the future. Keeping the alternative data.table --> granges -- sort --> data.table approach as backup alternative for now in case it's useful in the future, and have added some quick unit tests so it wont affect code coverage.
Will push these changes today.
GHA is failing due to BSgenome
currently failing on Bioc release and devel:
https://github.com/neurogenomics/MungeSumstats/actions/runs/3847722518/jobs/6554542961#step:2:1226
https://bioconductor.org/packages/release/bioc/html/BSgenome.html https://bioconductor.org/packages/devel/bioc/html/BSgenome.html
Will keep an eye on BSgenome
and rerun when it's fixed (hopefully soon).
Great thanks Brian! Let me know how that goes with BSgenome
Found out the full story about these deps not being available from Herve, who kindly took the time to explain it! https://github.com/Bioconductor/GenomicRanges/pull/73#issuecomment-1373070227
They should all be back working v soon.
@bschilder tabix indexing still seems to be failing in 1.7.13 (dev bioconductor):
> fullSS_path <- "~/Downloads/AD_sumstats_Jansenetal_2019sept.txt.gz"
> fullSS_path <- "~/Downloads/AD_sumstats_Jansenetal_2019sept.txt.gz"
> fullSS_path <- MungeSumstats::format_sumstats(path = fullSS_path,
+ ref_genome = "GRCH37",
+ dbSNP = 144,
+ tabix_index=TRUE)
******::NOTE::******
- Formatted results will be saved to `tempdir()` by default.
- This means all formatted summary stats will be deleted upon ending the R session.
- To keep formatted summary stats, change `save_path` ( e.g. `save_path=file.path('./formatted',basename(path))` ), or make sure to copy files elsewhere after processing ( e.g. `file.copy(save_path, './formatted/' )`.
********************
save_path suggests .gz output but tabix_index=TRUE Switching output to tabix-indexed format (.bgz).
Formatted summary statistics will be saved to ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//RtmpeSxzIN/filea17e476cac2a.tsv.bgz
Importing tabular file: ~/Downloads/AD_sumstats_Jansenetal_2019sept.txt.gz
|--------------------------------------------------|
|==================================================|
Checking for empty columns.
Standardising column headers.
First line of summary statistics file:
uniqID.a1a2 CHR BP A1 A2 SNP Z P Nsum Neff dir EAF BETA SE
Summary statistics report:
- 13,367,299 rows
- 13,354,030 unique variants
- 2,394 genome-wide significant variants (P<5e-8)
- 22 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
11,599 SNP IDs are not correctly formatted. These will be corrected from the reference genome.
Loading SNPlocs data.
165,861 SNP IDs appear to be made up of chr:bp, these will be replaced by their SNP ID from the reference genome
Loading SNPlocs data.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 13,307,214 SNPs using BSgenome::snpsById...
Loading required package: BiocGenerics
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, aperm, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval,
evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget, order,
paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
table, tapply, union, unique, unsplit, which.max, which.min
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: ‘S4Vectors’
The following objects are masked from ‘package:base’:
expand.grid, I, unname
BSgenome::snpsById done in 118 seconds.
1,161,417 SNPs are not on the reference genome. These will be corrected from the reference genome.
Loading SNPlocs data.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 12,315,061 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 146 seconds.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
There are 167,501 SNPs where neither A1 nor A2 match the reference genome. These will be removed.
There are 9,761,158 SNPs where A1 doesn't match the reference genome.
These will be flipped with their effect columns.
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Checking for missing data.
WARNING: 49,856 rows in sumstats file are missing data and will be removed.
Checking for duplicate columns.
Ensuring that the N column is all integers.
The sumstats N column is not all integers, this could effect downstream analysis. These will be converted to integers.
Checking for duplicate SNPs from SNP ID.
7,019 RSIDs are duplicated in the sumstats file. These duplicates will be removed
Checking for SNPs with duplicated base-pair positions.
93 base-pair positions are duplicated in the sumstats file. These duplicates will be removed.
INFO column not available. Skipping INFO score filtering step.
Filtering SNPs, ensuring SE>0.
Ensuring all SNPs have N<5 std dev above mean.
Removing 'chr' prefix from CHR.
Making X/Y/MT CHR uppercase.
Checking for bi-allelic SNPs.
332,624 SNPs are non-biallelic. These will be removed.
N already exists within sumstats_dt.
9,454,492 SNPs (80.3%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
The FRQ column was mapped from one of the following from the inputted summary statistics file:
FRQ, EAF, FREQUENCY, FRQ_U, F_U, MAF, FREQ, FREQ_TESTED_ALLELE, FRQ_TESTED_ALLELE, FREQ_EFFECT_ALLELE, FRQ_EFFECT_ALLELE, EFFECT_ALLELE_FREQUENCY, EFFECT_ALLELE_FREQ, EFFECT_ALLELE_FRQ, A1FREQ, A1FRQ, A2FREQ, A2FRQ, ALLELE_FREQUENCY, ALLELE_FREQ, ALLELE_FRQ, AF, MINOR_AF, EFFECT_AF, A2_AF, EFF_AF, ALT_AF, ALTERNATIVE_AF, INC_AF, A_2_AF, TESTED_AF, AF1, ALLELEFREQ, ALT_FREQ, EAF_HRC, EFFECTALLELEFREQ, FREQ.A1.1000G.EUR, FREQ.A1.ESP.EUR, FREQ.ALLELE1.HAPMAPCEU, FREQ.B, FREQ1, FREQ1.HAPMAP, FREQ_EUROPEAN_1000GENOMES, FREQ_HAPMAP, FREQ_TESTED_ALLELE_IN_HRS, FRQ_A1, FRQ_U_113154, FRQ_U_31358, FRQ_U_344901, FRQ_U_43456, POOLED_ALT_AF, AF_ALT, AF.ALT, AF-ALT, ALT.AF, ALT-AF, A2.AF, A2-AF, AF.EFF, AF_EFF, AF_EFF
As frq_is_maf=TRUE, the FRQ column will not be renamed. If the FRQ values were intended to represent major allele frequency,
set frq_is_maf=FALSE to rename the column as MAJOR_ALLELE_FRQ and differentiate it from minor/effect allele frequency.
Sorting coordinates.
Sorting coordinates.
Writing in tabular format ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//RtmpeSxzIN/filea17e476cac2a.tsv
Writing uncomressed instead of gzipped to enable index
Converting full summary stats file to tabix format for fast querying...
Reading header.
Ensuring file is bgzipped.
Tabix-indexing file.
[E::hts_idx_push] Unsorted positions on sequence #2: 71999998 followed by 7
Error in value[[3L]](cond) : index build failed
file: /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//RtmpeSxzIN/filea17e476cac2a.tsv.bgz
In addition: Warning message:
In withCallingHandlers(expr, message = function(c) if (inherits(c, :
NAs introduced by coercion
@Al-Murphy this now seems to be working consistently for me. Some of the dep source code may have been fixed since you last checked.
Great closing for now, we can reopen if it happens again
@Al-Murphy
I have the same issue with version (1.2.4) using the example data that you provide here: https://bioconductor.org/packages/release/bioc/vignettes/MungeSumstats/inst/doc/MungeSumstats.html
library(MungeSumstats)
eduAttainOkbayPth <- system.file("extdata", "eduAttainOkbay.txt", package = "MungeSumstats") formatted_path <- "/home/hao/Downloads/eduAttainOkbay_standardised.tsv.gz"
1. Read in the data and standardise header names
dat <- MungeSumstats::read_sumstats(path = eduAttainOkbayPth, standardise_headers = TRUE)
2. Write to disk as a compressed, tab-delimited, tabix-indexed file
formatted_path <- MungeSumstats::write_sumstats(sumstats_dt = dat, save_path = formatted_path, tabix_index = TRUE, write_vcf = FALSE,
return_path = TRUE)
Output is here:
Reading header. Tabular format detected. Importing tabular file: /home/hao/R/x86_64-pc-linux-gnu-library/4.1/MungeSumstats/extdata/eduAttainOkbay.txt Standardising column headers. First line of summary statistics file: MarkerName CHR POS A1 A2 EAF Beta SE Pval
Writing in tabular format ==> /home/hao/Downloads/eduAttainOkbay_standardised.tsv.gz Converting full summary stats file to tabix format for fast querying... Reading header. Ensuring file is bgzipped. Tabix-indexing file. [ti_index_core] the file out of order at line 5 Create tabix index failed for [ /home/hao/Downloads/eduAttainOkbay_standardised.tsv.bgz ]!
Do we have a solution for this? Because I need to convert my tab-delimiter GWAS summary into the tabix-index file
My Bioconductor version (Bioconductor version 3.14 (BiocManager 1.30.21), R 4.1.2 (2021-11-01))
Hey @anbai106,
MSS version 1.2 is now very old and there has been a lot of changes between that and the current release version 1.6.0.
Could you please try updating to the current release and see if this resolves your issue?
Cheers, Alan.
@Al-Murphy
Thanks for your reply. Somehow, I followed your tutorial to install the package, but the version that I got is 1.2.4:
I don't know if this is because of my R and Bioconductor versions, but it seems ok
Yep it's because of your R version, you need to install at least R v4.3.0
@Al-Murphy
Thanks! I will try this wonderful package later because I don't want to upgrade my R version for now.
A side point: R and Rstudio should allow users to easily switch different versions of R on Linux machines, like Python/Pycharms, so that we do not need to worry much about incompatibility across different R packages due to version issues.
Best regards
Hi @anbai106, while it does take a couple extra steps, you can have multiple versions of R/Rstudio installed on your computer at once by making containers: https://neurogenomics.github.io/MungeSumstats/articles/docker.html
I tried munging my summary statistics using fullSS_path <- MungeSumstats::format_sumstats(path = fullSS_path, ref_genome = "GRCH37",tabix_index=TRUE). It took several days to finish. Here is what I saw in R Console: Tabix-indexing file. [ti_index_core] the file out of order at line 1253647 Create tabix index failed for [ /var/folders/g9/f5z6vsbx7q3_wmk06jz0j9zm0000gn/T//RtmpUeD8qB/filec8a3e5725dc.tsv.bgz ]! Summary statistics report:
[.data.table
(sumstats_dt, ,:=
((names(empty_cols)), NULL)) : Column 'NA' does not exist to removeI'm wondering how to successfully creat tabix index. Thanks!