Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 16 forks source link

Error in ids2rowids(ids) : 'ids' cannot contain NAs #28

Closed bschilder closed 3 years ago

bschilder commented 3 years ago

Encountered when processing VCF step-by-step in format_sumstats.

Specifically, the get_genome_build step.

Data source: https://gwas.mrcieu.ac.uk/files/ieu-a-1124/ieu-a-1124.vcf.gz

Screenshot 2021-07-13 at 16 23 43

bschilder commented 3 years ago

Potential solution, swap the order of these functions?:

#### Infer reference genome if necessary ####
    if(is.null(ref_genome))
      ref_genome <- get_genome_build(sumstats = sumstats_return$sumstats_dt)

    #### Check 5: Check for uniformity in SNP col - no mix of rs/missing rs/chr:bp ####
    sumstats_return <- 
      check_no_rs_snp(sumstats_dt = sumstats_return$sumstats_dt,
                      path = path, 
                      ref_genome = ref_genome)

Update:

Guess that doesnt really make sense bc the latter requires ref_genome. So perhaps do some filtering during get_genome_build

bschilder commented 3 years ago

Fixed

Added filtering step get_genome_build seems to work. Also added downsampling to speed up the func substantially.

sampled_snps <- 10000
...
#### Do some filtering first to avoid errors ####
  sumstats <- sumstats[complete.cases(SNP)]

  #### Downsample SNPs to save time #### 
  if((nrow(sumstats)>sampled_snps) && !(is.null(sampled_snps))){ 
    snps <- sample(sumstats$SNP,sampled_snps)
  } else {snps <- sumstats$SNP}

  sumstats <- sumstats[SNP %in% snps,]
...