Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 16 forks source link

liftover not working for certain region #185

Closed zhana-optima closed 5 months ago

zhana-optima commented 6 months ago

1. Bug description

I am running liftover over a series of LD blocks. It works fine for all chromosomes until chromosome 17, and then after I reach one LD block in chromosome 17, it does not work anymore. It just does not output any coordinates anymore. It was working fine on the same block previously. The coords are just empty

2. Reproducible example

Code

  tmp.frq = read.table('tmp.txt
[tmp.txt](https://github.com/neurogenomics/MungeSumstats/files/15144054/tmp.txt)
')      
  coords = tmp.frq %>%
              dplyr::select(CHR, BP, SNP) %>%
              MungeSumstats::liftover(
                ref_genome = "hg19",
                convert_ref_genome = "hg38"
              ) %>%
              shhh() # Perform liftover from hg19 to hg38

Data

Stored in tmp.txt - attached.

Al-Murphy commented 6 months ago

This seems to run okay for me on v1.11.10:

tmp.frq = read.table('~/Downloads/tmp.txt',header = 1)      
coords = tmp.frq %>%
  dplyr::select(CHR, BP, SNP) %>%
  MungeSumstats::liftover(
    ref_genome = "hg19",
    convert_ref_genome = "hg38"
  )# Perform liftover from hg19 to hg38

print(unique(tmp.frq$CHR))
[1] 17
print(nrow(tmp.frq))
[1] 5107
print(unique(coords$CHR))
[1] 17
Levels: 17
print(nrow(coords[complete.cases(SNP,CHR,BP),]))
[1] 4693

You lose 10% of SNPs which I'm guessing is more to do with liftover mappings than MSS, did you get more removed? What version of MSS are you using?

Thanks, Alan.

zhana-optima commented 6 months ago

Dear Al-Murphy, Thank you very much for your response. We are currently using v1.11.5 but we noted the default chain file in it is 'ensembl'. We currently do not own a commercial license to use 'ucsc'. We downloaded the chain files separately to test using an alternative liftover tool, and it appears that the ensembl file does not liftover that region with that other tool either, unlike the ucsc chain file. Is there something that can be done with regards to that? Seems it is a bigger issue that is not software related.

Al-Murphy commented 5 months ago

The problem here is that the ensembl chain file is just missing mappings for 414 SNPs in this region, hence why it only returns 4693 rows versus the full 5107 which you get when using 'ucsc' (see below for my ucsc run).

tmp.frq = read.table('~/Downloads/tmp.txt',header = 1)      
coords = tmp.frq %>%
  dplyr::select(CHR, BP, SNP) %>%
  MungeSumstats::liftover(
    ref_genome = "hg19",
    convert_ref_genome = "hg38",
    chain_source = "ucsc"
  )# Perform liftover from hg19 to hg38

print(unique(tmp.frq$CHR))
[1] 17
print(nrow(tmp.frq))
[1] 5107
print(unique(coords$CHR))
[1] 17
Levels: 17
print(nrow(coords[complete.cases(SNP,CHR,BP),]))
[1] 5107

This is not an issue with MungeSumstats, I confirmed this by running this line-by-line in the liftover function. The offending code is:

a <- rtracklayer::liftOver(x = gr,chain = chain)

which maps the data in a genomic ranges object gr to the chain file chain. This function works perfectly with the ucsc chain file but returns a zero element for the first 414 SNPs of the ensembl:

> a[0]
GRangesList object of length 0:
<0 elements>
> a[1000]
GRangesList object of length 1:
[[1]]
GRanges object with 1 range and 1 metadata column:
      seqnames    ranges strand |             SNP
         <Rle> <IRanges>  <Rle> |     <character>
  [1]       17  39055759      * | 17_37212012_C_T
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

> len_a <- unlist(lapply(a,length))
> len_a
   [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  [81] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [161] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [241] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [321] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [401] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [481] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [561] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [641] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [721] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [801] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [881] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [961] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [ reached getOption("max.print") -- omitted 4107 entries ]
> sum(len_a)
[1] 4693
> length(len_a)
[1] 5107
> length(len_a)-sum(len_a)
[1] 414

Best approach for you is to raise this with someone on the ensembl team as to why these regions aren't mapped rather than MSS/rtracklayer. Closing this issue for now but if you find out that there are mappings in the raw chain file that are not appearing with rtracklayer::liftOver, reopen and/or tag me in the issue with rtracklayer and we can take it from there.

Alan.