Open Ennazhou opened 1 week ago
Hi @Ennazhou,
I am actually working on adding this feature right now, which will become available in the next 1-2 days via a new release of the software. You will be able to tune a few parameters regarding how to define the groups. However, please give us 1-2 weeks so that we can test the new pipeline more extensively. I will ping you here when I feel it is well tested. Thank you for your patience!
Hi @Ennazhou
I just released a new version of the software, which will allow you to generate corresponding LD files based on individual level data, see docs for details. We tested it on ~1200 admixed American samples genome-wide, and the app seem to work okay. However, note you will be one of the first beta testers. I'll be happy to provide assistance if you run into any problems or have questions. Good luck!
Hi @biona001
Thank you for releasing the new version!
I am having trouble determining the start_bp and end_bp parameters. For the snps correlation, the genotype data needs to be in the "FBM.code256" class. The "bigstatsr" package has a function called FBM.code256() that can create a Filebacked Big Matrix, but I am unsure how to specify the arguments for that function. Could you please provide some guidance on how to use the FBM.code256() function correctly? Any related materials or resources on this topic would also be appreciated.
Hi @Ennazhou,
Although I suggested using bigsnpr
to compute approximately independent regions, I've actually not used it myself personally.
What is the ancestry background of your samples? For simplicity, maybe you can first try directly using one of ldetect's precomputed regions? For example I used the EUR region for testing my admixed American samples even though EUR regions probably do not adapt well to AMR samples, but it was a quick and dirty test to get things going. If performance is good, then you can try to improve by computing better start_bp
s and end_bp
s.
Thank you for your help! I saw that the solveblock function requires vcf file as input but we simulated genotype matrix in R containing only 0, 1, 2. Can we still use the sovleblock command to generate the LD files? Thank you!
Currently solveblock
only works on VCF files, since we need the chr/pos/ref/alt for each variant. Of course you can still use solveblock
with fully synthetic genotypes, but then for each variant you'll have to make up arbitrary values for chr/pos/ref/alt, and then save the result into a VCF file.
Thus, I think it is easier (and more realistic) to start with real genotypes and simulate phenotypes. This ensures you have realistic LD structure. If you do not have any real data, you can use what is publicaly available, e.g. 1000 genomes data. For example, the following simulation was used to test solveblock
:
# in Julia, load packages
using VCFTools, Random, CSV, Distributions, StatsBase
# ghost knockoff executable
ghostknockoffgwas = "/home/groups/sabatti/.julia/dev/GhostKnockoffGWAS/app_linux_x86/bin/GhostKnockoffGWAS"
# some helper functions
function pval2zscore(pvals::AbstractVector{T}, beta::AbstractVector{T}) where T
length(pvals) == length(beta) ||
error("pval2zscore: pvals and beta should have the same length")
return zscore.(pvals, beta)
end
zscore(p::T, beta::T) where T = sign(beta) * quantile(Normal(), p/2)
pval(z::T) where T = 2ccdf(Normal(), abs(z))
# some unrealistic simulation parameters I made up just to check things works
k = 50
mu = 100
sigma = 10.0 # beta ~ N(mu, sigma)
# import VCF and remove SNPs with MAF < 0.1
vcffile = "/oak/stanford/groups/zihuai/paisa/VCF/chr1.vcf.gz"
X, sampleID, chr, pos, rsid, ref, alt =
convert_gt(Float64, vcffile, impute=true, center=false, scale=false,
save_snp_info=true)
mafs = mean.(skipmissing.(eachcol(X))) ./ 2
idx = findall(x -> 0.1 < x < 0.9, mafs)
X = X[:, idx]
chr, pos, rsid, ref, alt = chr[idx], pos[idx], vcat(rsid...)[idx],
ref[idx], vcat(alt...)[idx]
n, p = size(X)
@info "Detected $n samples and $p SNPs"
# simulate phenotypes and normalize it
beta = zeros(p)
beta[1:k] .= rand(Normal(mu, sigma), k)
shuffle!(beta)
y = X * beta + randn(n)
zscore!(y, mean(y), std(y))
# marginal association test: Z scores and associated p-values
z = X'*y ./ sqrt(n)
pvals = pval.(z)
# run GhostKnockoffGWAS on output of `solveblock`
zfile = "/oak/stanford/groups/zihuai/paisa/VCF/zfile.txt"
LD_files = "/oak/stanford/groups/zihuai/paisa/LD_files"
outfile = "/oak/stanford/groups/zihuai/paisa/VCF/GK_out"
CSV.write(zfile, DataFrame("CHR"=>chr,"POS"=>pos,"REF"=>ref,"ALT"=>alt,"Z"=>z))
run(`$ghostknockoffgwas --zfile $zfile --LD-files $LD_files --N $n --genome-build 19 --out $outfile`)
# compare power
causal_snps = rsid[findall(!iszero, beta)]
marginal_discover = rsid[findall(x -> x < 0.05/length(beta), pvals)]
marginal_power = length(marginal_discover ∩ causal_snps) / length(causal_snps)
marginal_FP = length(setdiff(marginal_discover, causal_snps))
GK_df = CSV.read(outfile * ".txt", DataFrame)
GK_discover = GK_df[findall(isone, GK_df[!, "selected_fdr0.1"]), "rsid"]
GK_power = length(GK_discover ∩ causal_snps) / length(causal_snps)
GK_FP = length(setdiff(GK_discover, causal_snps))
println("\n\n marginal_power = $marginal_power, marginal false positives = $marginal_FP")
println("GK_power = $GK_power, GK false positives = $GK_FP \n\n");
This gave the following output:
[ Info: Detected 1197 samples and 19479 SNPs
marginal_power = 0.06, marginal false positives = 2
GK_power = 0.22, GK false positives = 0
Although this simulation was done in Julia, I'm sure you can do something very similar in R.
Hello Benjamin!
We are interested in using your method in our study, but we have found that only 20% of the SNPs overlap between our data and the reference information you provided. We have the raw genotype data available.
We want to ensure that we can properly apply your method to our dataset. Any guidance you can provide would be greatly appreciated.
Thank you for your time!