JEFworks-Lab / HoneyBADGER

HMM-integrated Bayesian approach for detecting CNV and LOH events from single-cell RNA-seq data
http://jef.works/HoneyBADGER/
GNU General Public License v3.0
95 stars 31 forks source link

vcf files for heterozygous SNPs #21

Closed photoonh closed 4 years ago

photoonh commented 5 years ago

Hi Jean,

I want to try honeyBadger with SNPs from GRch38 reference genome. So I downloaded vcf from ExAC and followed steps;

vcfFile <- "legacy-exacv1_downloads-release1-ExAC.r1.sites.vep.vcf.gz" require(GenomicRanges) testRanges <- GRanges(seqnames = "1", IRanges(1, 248956422)) require(VariantAnnotation) param <- ScanVcfParam(which=testRanges) vcf <- readVcf(vcfFile, "hg38", param = param) snps <- rowData(vcf)

The result looks like

DataFrame with 939046 rows and 5 columns paramRangeID REF

1:13372_G/C NA G 1:13380_C/G NA C 1:13382_C/G NA C 1:13402_G/C NA G 1:13404_G/A NA G ... ... ... 1:248903130_C/T NA C 1:248903149_G/GAGTCTAGGCAATCTTCCCAGAATGGAAACCCAATCCACTCTTACTA NA G 1:248903165_G/A NA G rs73148524 NA C 1:248903190_C/T NA C ALT QUAL 1:13372_G/C C 608.91 1:13380_C/G G 7829.15 1:13382_C/G G 320.4 1:13402_G/C C 89.66 1:13404_G/A A,T 136.58 ... ... ... 1:248903130_C/T T 3402.57 1:248903149_G/GAGTCTAGGCAATCTTCCCAGAATGGAAACCCAATCCACTCTTACTA GAGTCTAGGCAATCTTCCCAGAATGGAAACCCAATCCACTCTTACTA 744.88 1:248903165_G/A A 11336.29 rs73148524 T 1602562.81 1:248903190_C/T T 592.42 FILTER 1:13372_G/C PASS 1:13380_C/G VQSRTrancheSNP99.60to99.80 1:13382_C/G VQSRTrancheSNP99.60to99.80 1:13402_G/C VQSRTrancheSNP99.60to99.80 1:13404_G/A VQSRTrancheSNP99.60to99.80 ... ... 1:248903130_C/T PASS 1:248903149_G/GAGTCTAGGCAATCTTCCCAGAATGGAAACCCAATCCACTCTTACTA VQSRTrancheINDEL99.00to99.50 1:248903165_G/A PASS rs73148524 PASS 1:248903190_C/T PASS This format is different from what you provid in honeyBadger package. I also tried other vcf files from https://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/#table-1 Those results are even more different from yours. Am I on right track? Could you tell me which vcf files I need to use for most updated SNPs from hg38? Best, Doogie
JEFworks commented 5 years ago

Hi Doogie,

I see. It does look like either the vcf format has been changed, or the readVcf from the VariantAnnotation package has been updated somehow.

Let me explain what the vcf files are used for and hopefully that will help you get on the right track.

If you look at a set of snps that come with the HoneyBADGER package, you'll see that it is a GRanges object with a number of columns:

load(system.file("ExAC", "ExAC_chr1.RData", package = "HoneyBADGER"))
print(head(snps))
GRanges object with 6 ranges and 5 metadata columns:
              seqnames         ranges strand | paramRangeID            REF                ALT
                 <Rle>      <IRanges>  <Rle> |     <factor> <DNAStringSet> <DNAStringSetList>
  1:17365_C/G        1 [17365, 17365]      * |         <NA>              C                  G
  1:17385_G/A        1 [17385, 17385]      * |         <NA>              G                  A
  1:69270_A/G        1 [69270, 69270]      * |         <NA>              A                  G
   rs75062661        1 [69511, 69511]      * |         <NA>              A                  G
  1:69761_A/T        1 [69761, 69761]      * |         <NA>              A                  T
  1:69897_T/C        1 [69897, 69897]      * |         <NA>              T                  C
                     QUAL                 FILTER
                <numeric>            <character>
  1:17365_C/G    826621.1 InbreedingCoeff_Filter
  1:17385_G/A    592354.5 InbreedingCoeff_Filter
  1:69270_A/G   1758695.3                   PASS
   rs75062661 120729371.2                   PASS
  1:69761_A/T   2041395.6                   PASS
  1:69897_T/C   1171733.0                   PASS

This snps object is used by the getCoverage and getAlleleCount functions to create the allele count matrices for the total coverage and the reference allele count at each SNP site included in the snps object respectively.

Specifically, both the getCoverage and getAlleleCount functions looks for the seqnames column, which specified the chromosome, and ranges column, which specifies the SNP genomic location, within your snps object. The getAlleleCount function additionally looks for the REF and ALT columns to specifically count the reference and alternative alleles of interest.

So to summarize, in order to create the reference allele counts and total coverage count matrices used by HoneyBADGER, you will want to parse the vcf file to include information regarding the seqname, range, REF allele, and ALT allele information for each SNP. All the other quality, filtering, etc information is for your reference.

Note, that you will want to restrict to "single nucleotide polymorphisms" e.g. alternations affecting single positions in the genome. So something like "1:248903149_G/GAGTCTAGGCAATCTTCCCAGAATGGAAACCCAATCCACTCTTACTA" which is an alternation from G at chromosome 1 position 248903149 to GAGTCTAGGCAATCTTCCCAGAATGGAAACCCAATCCACTCTTACTA should be removed.

Also, to avoid deleterious mutation, you may want to filter to alternations that are common (>5% minor allele frequency) in your population of interest. I'm actually not sure where this information is in the new vcf format. You will want to inspect your vcf file to see if this information is still present and just missed by the readVcf function somehow.

Hope that helps, Jean

photoonh commented 5 years ago

Hi Jean,

Thank you for detailed directions. Sorry for the messy result. I went through more steps;

info <- info(vcf) maf <- info[, 'AF'] maft <- 0.1 vi <- sapply(maf, function(x) any(x > maft)) snps <- snps[vi,] vi <- width(snps@listData$REF) == 1 snps <- snps[vi,]

This gives: DataFrame with 7711 rows and 5 columns paramRangeID REF ALT QUAL FILTER

1:17365_C/G NA C G 826621.11 InbreedingCoeff_Filter 1:17385_G/A NA G A 592354.47 InbreedingCoeff_Filter 1:69270_A/G NA A G 1758695.33 PASS rs75062661 NA A G 120729371.2 PASS 1:69761_A/T NA A T 2041395.62 PASS ... ... ... ... ... ... rs41311583 NA G T 22502718.8 PASS rs11485825 NA G A 195525472.02 PASS rs4509608 NA C T 216242907.57 PASS rs4575113 NA T C 252054961.28 PASS rs4462184 NA A G 231590811.93 PASS I will dig into this and see what is going to happen. Best, Doogie
photoonh commented 5 years ago

Hi Jean,

I went through further steps to generate the number of reads to each SNP site using .bam files.

results <- getSnpMats(snps, bamFiles, indexFiles, n.cores = 4)

But, this generates the following error;

Error in rowSums(cov) : 'x' must be numeric In addition: Warning message: In mclapply(seq_along(bamFiles), function(i) { : all scheduled cores encountered errors in user code

I think they changed the readVcf function. The result still has the same 5 metadata columns, but without range information. Do you know how to fix this?

Best, Doogie

JEFworks commented 5 years ago

Hi Doogie,

Based on the output you shared previously, it looks like the range information can be parsed out from the rownames.

For example, 1:13372_G/C should have a seqname of 1 and a range of IRanges(13372, 13372) with a REF of G and ALT of C.

One minor caveat that is worth checking is that I'm not sure if the current vcf positional information is zero-based or one-based. image

Rsamtools, which HoneyBADGER uses to get pileups of your allele counts is one-based: https://bioconductor.org/packages/devel/bioc/manuals/Rsamtools/man/Rsamtools.pdf (there is a zeroBased parameter but it is FALSE by default)

So you will actually want provide range information that is one-based.

If you provide zero-based ranges when you should provide one-based ranges, you will end up counting two bases, and your results will look very strange!

Hope that helps!

Best, Jean

joeymays commented 5 years ago

I ran into a similar error where GetSnpMats() was expecting a GRanges object, but the snps object generated by following the tutorial was a DataFrame.

It seems that changing the snps <- rowData(vcf) step to snps <- rowRanges(vcf) solved the issue, in line with Jean's theory that the vcf format changed or the readVcf function had changed.

JEFworks commented 4 years ago

I have confirmed @joeymays is correct. The tutorial has been updated to use rowRanges instead of rowData (https://jef.works/HoneyBADGER/Preparing_Data.html) in commit https://github.com/JEFworks/HoneyBADGER/commit/6515034d60ff952be3fa6933867bd0803b89d9f5