Closed photoonh closed 4 years ago
Hi Doogie,
I see. It does look like either the vcf
format has been changed, or the readVcf
from the VariantAnnotation
package has been updated somehow.
Let me explain what the vcf
files are used for and hopefully that will help you get on the right track.
If you look at a set of snps that come with the HoneyBADGER package, you'll see that it is a GRanges object with a number of columns:
load(system.file("ExAC", "ExAC_chr1.RData", package = "HoneyBADGER"))
print(head(snps))
GRanges object with 6 ranges and 5 metadata columns:
seqnames ranges strand | paramRangeID REF ALT
<Rle> <IRanges> <Rle> | <factor> <DNAStringSet> <DNAStringSetList>
1:17365_C/G 1 [17365, 17365] * | <NA> C G
1:17385_G/A 1 [17385, 17385] * | <NA> G A
1:69270_A/G 1 [69270, 69270] * | <NA> A G
rs75062661 1 [69511, 69511] * | <NA> A G
1:69761_A/T 1 [69761, 69761] * | <NA> A T
1:69897_T/C 1 [69897, 69897] * | <NA> T C
QUAL FILTER
<numeric> <character>
1:17365_C/G 826621.1 InbreedingCoeff_Filter
1:17385_G/A 592354.5 InbreedingCoeff_Filter
1:69270_A/G 1758695.3 PASS
rs75062661 120729371.2 PASS
1:69761_A/T 2041395.6 PASS
1:69897_T/C 1171733.0 PASS
This snps
object is used by the getCoverage
and getAlleleCount
functions to create the allele count matrices for the total coverage and the reference allele count at each SNP site included in the snps
object respectively.
Specifically, both the getCoverage
and getAlleleCount
functions looks for the seqnames
column, which specified the chromosome, and ranges
column, which specifies the SNP genomic location, within your snps
object. The getAlleleCount
function additionally looks for the REF
and ALT
columns to specifically count the reference and alternative alleles of interest.
So to summarize, in order to create the reference allele counts and total coverage count matrices used by HoneyBADGER, you will want to parse the vcf
file to include information regarding the seqname
, range
, REF
allele, and ALT
allele information for each SNP. All the other quality, filtering, etc information is for your reference.
Note, that you will want to restrict to "single nucleotide polymorphisms" e.g. alternations affecting single positions in the genome. So something like "1:248903149_G/GAGTCTAGGCAATCTTCCCAGAATGGAAACCCAATCCACTCTTACTA" which is an alternation from G at chromosome 1 position 248903149 to GAGTCTAGGCAATCTTCCCAGAATGGAAACCCAATCCACTCTTACTA should be removed.
Also, to avoid deleterious mutation, you may want to filter to alternations that are common (>5% minor allele frequency) in your population of interest. I'm actually not sure where this information is in the new vcf
format. You will want to inspect your vcf
file to see if this information is still present and just missed by the readVcf
function somehow.
Hope that helps, Jean
Hi Jean,
Thank you for detailed directions. Sorry for the messy result. I went through more steps;
info <- info(vcf) maf <- info[, 'AF'] maft <- 0.1 vi <- sapply(maf, function(x) any(x > maft)) snps <- snps[vi,] vi <- width(snps@listData$REF) == 1 snps <- snps[vi,]
This gives: DataFrame with 7711 rows and 5 columns paramRangeID REF ALT QUAL FILTER
Hi Jean,
I went through further steps to generate the number of reads to each SNP site using .bam files.
results <- getSnpMats(snps, bamFiles, indexFiles, n.cores = 4)
But, this generates the following error;
Error in rowSums(cov) : 'x' must be numeric In addition: Warning message: In mclapply(seq_along(bamFiles), function(i) { : all scheduled cores encountered errors in user code
I think they changed the readVcf function. The result still has the same 5 metadata columns, but without range information. Do you know how to fix this?
Best, Doogie
Hi Doogie,
Based on the output you shared previously, it looks like the range information can be parsed out from the rownames.
For example, 1:13372_G/C
should have a seqname
of 1
and a range of IRanges(13372, 13372)
with a REF
of G
and ALT
of C
.
One minor caveat that is worth checking is that I'm not sure if the current vcf
positional information is zero-based or one-based.
Rsamtools
, which HoneyBADGER
uses to get pileups of your allele counts is one-based: https://bioconductor.org/packages/devel/bioc/manuals/Rsamtools/man/Rsamtools.pdf
(there is a zeroBased parameter but it is FALSE by default)
So you will actually want provide range information that is one-based.
If you provide zero-based ranges when you should provide one-based ranges, you will end up counting two bases, and your results will look very strange!
Hope that helps!
Best, Jean
I ran into a similar error where GetSnpMats()
was expecting a GRanges
object, but the snps
object generated by following the tutorial was a DataFrame
.
It seems that changing the snps <- rowData(vcf)
step to snps <- rowRanges(vcf)
solved the issue, in line with Jean's theory that the vcf
format changed or the readVcf
function had changed.
I have confirmed @joeymays is correct. The tutorial has been updated to use rowRanges instead of rowData (https://jef.works/HoneyBADGER/Preparing_Data.html) in commit https://github.com/JEFworks/HoneyBADGER/commit/6515034d60ff952be3fa6933867bd0803b89d9f5
Hi Jean,
I want to try honeyBadger with SNPs from GRch38 reference genome. So I downloaded vcf from ExAC and followed steps;
vcfFile <- "legacy-exacv1_downloads-release1-ExAC.r1.sites.vep.vcf.gz" require(GenomicRanges) testRanges <- GRanges(seqnames = "1", IRanges(1, 248956422)) require(VariantAnnotation) param <- ScanVcfParam(which=testRanges) vcf <- readVcf(vcfFile, "hg38", param = param) snps <- rowData(vcf)
The result looks like
DataFrame with 939046 rows and 5 columns paramRangeID REF