Using your method with our data

Ennazhou commented 1 week ago

Hello Benjamin!

We are interested in using your method in our study, but we have found that only 20% of the SNPs overlap between our data and the reference information you provided. We have the raw genotype data available.

Is it possible for us to generate the corresponding LD files based on our own data?
Can we specify the groups ourselves for our analysis, or do we need to use the groups you have provided?

We want to ensure that we can properly apply your method to our dataset. Any guidance you can provide would be greatly appreciated.

Thank you for your time!

biona001 commented 1 week ago

Hi @Ennazhou,

I am actually working on adding this feature right now, which will become available in the next 1-2 days via a new release of the software. You will be able to tune a few parameters regarding how to define the groups. However, please give us 1-2 weeks so that we can test the new pipeline more extensively. I will ping you here when I feel it is well tested. Thank you for your patience!

biona001 commented 4 days ago

Hi @Ennazhou

I just released a new version of the software, which will allow you to generate corresponding LD files based on individual level data, see docs for details. We tested it on ~1200 admixed American samples genome-wide, and the app seem to work okay. However, note you will be one of the first beta testers. I'll be happy to provide assistance if you run into any problems or have questions. Good luck!

Ennazhou commented 1 day ago

Hi @biona001

Thank you for releasing the new version!

I am having trouble determining the start_bp and end_bp parameters. For the snps correlation, the genotype data needs to be in the "FBM.code256" class. The "bigstatsr" package has a function called FBM.code256() that can create a Filebacked Big Matrix, but I am unsure how to specify the arguments for that function. Could you please provide some guidance on how to use the FBM.code256() function correctly? Any related materials or resources on this topic would also be appreciated.

biona001 commented 1 day ago

Hi @Ennazhou,

Although I suggested using bigsnpr to compute approximately independent regions, I've actually not used it myself personally.

What is the ancestry background of your samples? For simplicity, maybe you can first try directly using one of ldetect's precomputed regions? For example I used the EUR region for testing my admixed American samples even though EUR regions probably do not adapt well to AMR samples, but it was a quick and dirty test to get things going. If performance is good, then you can try to improve by computing better start_bps and end_bps.

Ennazhou commented 1 day ago

Thank you for your help! I saw that the solveblock function requires vcf file as input but we simulated genotype matrix in R containing only 0, 1, 2. Can we still use the sovleblock command to generate the LD files? Thank you!

biona001 commented 1 day ago

Currently solveblock only works on VCF files, since we need the chr/pos/ref/alt for each variant. Of course you can still use solveblock with fully synthetic genotypes, but then for each variant you'll have to make up arbitrary values for chr/pos/ref/alt, and then save the result into a VCF file.

Thus, I think it is easier (and more realistic) to start with real genotypes and simulate phenotypes. This ensures you have realistic LD structure. If you do not have any real data, you can use what is publicaly available, e.g. 1000 genomes data. For example, the following simulation was used to test solveblock:

# in Julia, load packages 
using VCFTools, Random, CSV, Distributions, StatsBase

# ghost knockoff executable
ghostknockoffgwas = "/home/groups/sabatti/.julia/dev/GhostKnockoffGWAS/app_linux_x86/bin/GhostKnockoffGWAS"

# some helper functions
function pval2zscore(pvals::AbstractVector{T}, beta::AbstractVector{T}) where T
    length(pvals) == length(beta) || 
        error("pval2zscore: pvals and beta should have the same length")
    return zscore.(pvals, beta)
end
zscore(p::T, beta::T) where T = sign(beta) * quantile(Normal(), p/2)
pval(z::T) where T = 2ccdf(Normal(), abs(z))

# some unrealistic simulation parameters I made up just to check things works
k = 50
mu = 100
sigma = 10.0 # beta ~ N(mu, sigma)

# import VCF and remove SNPs with MAF < 0.1
vcffile = "/oak/stanford/groups/zihuai/paisa/VCF/chr1.vcf.gz"
X, sampleID, chr, pos, rsid, ref, alt = 
    convert_gt(Float64, vcffile, impute=true, center=false, scale=false, 
    save_snp_info=true)
mafs = mean.(skipmissing.(eachcol(X))) ./ 2
idx = findall(x -> 0.1 < x < 0.9, mafs)
X = X[:, idx]
chr, pos, rsid, ref, alt = chr[idx], pos[idx], vcat(rsid...)[idx], 
    ref[idx], vcat(alt...)[idx]
n, p = size(X)
@info "Detected $n samples and $p SNPs"

# simulate phenotypes and normalize it
beta = zeros(p)
beta[1:k] .= rand(Normal(mu, sigma), k)
shuffle!(beta)
y = X * beta + randn(n)
zscore!(y, mean(y), std(y))

# marginal association test: Z scores and associated p-values
z = X'*y ./ sqrt(n)
pvals = pval.(z)

# run GhostKnockoffGWAS on output of `solveblock`
zfile = "/oak/stanford/groups/zihuai/paisa/VCF/zfile.txt"
LD_files = "/oak/stanford/groups/zihuai/paisa/LD_files"
outfile = "/oak/stanford/groups/zihuai/paisa/VCF/GK_out"
CSV.write(zfile, DataFrame("CHR"=>chr,"POS"=>pos,"REF"=>ref,"ALT"=>alt,"Z"=>z))
run(`$ghostknockoffgwas --zfile $zfile --LD-files $LD_files --N $n --genome-build 19 --out $outfile`)

# compare power
causal_snps = rsid[findall(!iszero, beta)]
marginal_discover = rsid[findall(x -> x < 0.05/length(beta), pvals)]
marginal_power = length(marginal_discover ∩ causal_snps) / length(causal_snps)
marginal_FP = length(setdiff(marginal_discover, causal_snps))
GK_df = CSV.read(outfile * ".txt", DataFrame)
GK_discover = GK_df[findall(isone, GK_df[!, "selected_fdr0.1"]), "rsid"]
GK_power = length(GK_discover ∩ causal_snps) / length(causal_snps)
GK_FP = length(setdiff(GK_discover, causal_snps))
println("\n\n marginal_power = $marginal_power, marginal false positives = $marginal_FP")
println("GK_power = $GK_power, GK false positives = $GK_FP \n\n");

This gave the following output:

[ Info: Detected 1197 samples and 19479 SNPs

 marginal_power = 0.06, marginal false positives = 2
GK_power = 0.22, GK false positives = 0

Although this simulation was done in Julia, I'm sure you can do something very similar in R.

biona001 / GhostKnockoffGWAS

Using your method with our data #17