Closed zhana-optima closed 6 months ago
This seems to run okay for me on v1.11.10:
tmp.frq = read.table('~/Downloads/tmp.txt',header = 1)
coords = tmp.frq %>%
dplyr::select(CHR, BP, SNP) %>%
MungeSumstats::liftover(
ref_genome = "hg19",
convert_ref_genome = "hg38"
)# Perform liftover from hg19 to hg38
print(unique(tmp.frq$CHR))
[1] 17
print(nrow(tmp.frq))
[1] 5107
print(unique(coords$CHR))
[1] 17
Levels: 17
print(nrow(coords[complete.cases(SNP,CHR,BP),]))
[1] 4693
You lose 10% of SNPs which I'm guessing is more to do with liftover mappings than MSS, did you get more removed? What version of MSS are you using?
Thanks, Alan.
Dear Al-Murphy, Thank you very much for your response. We are currently using v1.11.5 but we noted the default chain file in it is 'ensembl'. We currently do not own a commercial license to use 'ucsc'. We downloaded the chain files separately to test using an alternative liftover tool, and it appears that the ensembl file does not liftover that region with that other tool either, unlike the ucsc chain file. Is there something that can be done with regards to that? Seems it is a bigger issue that is not software related.
The problem here is that the ensembl chain file is just missing mappings for 414 SNPs in this region, hence why it only returns 4693 rows versus the full 5107 which you get when using 'ucsc' (see below for my ucsc run).
tmp.frq = read.table('~/Downloads/tmp.txt',header = 1)
coords = tmp.frq %>%
dplyr::select(CHR, BP, SNP) %>%
MungeSumstats::liftover(
ref_genome = "hg19",
convert_ref_genome = "hg38",
chain_source = "ucsc"
)# Perform liftover from hg19 to hg38
print(unique(tmp.frq$CHR))
[1] 17
print(nrow(tmp.frq))
[1] 5107
print(unique(coords$CHR))
[1] 17
Levels: 17
print(nrow(coords[complete.cases(SNP,CHR,BP),]))
[1] 5107
This is not an issue with MungeSumstats, I confirmed this by running this line-by-line in the liftover function. The offending code is:
a <- rtracklayer::liftOver(x = gr,chain = chain)
which maps the data in a genomic ranges object gr
to the chain file chain
. This function works perfectly with the ucsc
chain file but returns a zero element for the first 414 SNPs of the ensembl
:
> a[0]
GRangesList object of length 0:
<0 elements>
> a[1000]
GRangesList object of length 1:
[[1]]
GRanges object with 1 range and 1 metadata column:
seqnames ranges strand | SNP
<Rle> <IRanges> <Rle> | <character>
[1] 17 39055759 * | 17_37212012_C_T
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
> len_a <- unlist(lapply(a,length))
> len_a
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[81] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[161] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[241] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[321] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[401] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[481] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[561] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[641] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[721] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[801] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[881] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[961] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[ reached getOption("max.print") -- omitted 4107 entries ]
> sum(len_a)
[1] 4693
> length(len_a)
[1] 5107
> length(len_a)-sum(len_a)
[1] 414
Best approach for you is to raise this with someone on the ensembl team as to why these regions aren't mapped rather than MSS/rtracklayer. Closing this issue for now but if you find out that there are mappings in the raw chain file that are not appearing with rtracklayer::liftOver
, reopen and/or tag me in the issue with rtracklayer and we can take it from there.
Alan.
1. Bug description
I am running liftover over a series of LD blocks. It works fine for all chromosomes until chromosome 17, and then after I reach one LD block in chromosome 17, it does not work anymore. It just does not output any coordinates anymore. It was working fine on the same block previously. The coords are just empty
2. Reproducible example
Code
Data
Stored in tmp.txt - attached.