freeseek / score

Tools to work with GWAS-VCF summary statistics files
MIT License
94 stars 6 forks source link

colheaders.tsv format #12

Open jjfarrell opened 1 month ago

jjfarrell commented 1 month ago

For the munge -C colheaders.tsv option, what is the format of the colheaders.tsv file? The -c options does not include ourput from GENESIS (https://github.com/UW-GAC/GENESIS).

freeseek commented 1 month ago

Did you check the Column Headers Mappings section in the documentation? It shows how to create that file. If it does not have the column headers from GENESIS let me know

jjfarrell commented 1 month ago

Thanks! I found that section. The column headers for GENESIS were not there. Their columns had a lot of small caps. So I made a genesis.colheaders.tsv. It looks like this:

rsid SNP P P n.obs N chr CHR pos BP Est BETA alt A1 ref A2 Est.SE SE eff.frq FRQ Z Z mac AC

I assigned the `mac column (minor allele count) to AC which is probably not ideal. A lot of our users and TOPMed researchers use GENESIS so it would be useful to have a -c genesis option.

jjfarrell commented 1 month ago

@freeseek I ran metal on the 4 summary files (3 with SAIGE output and 1 with this GENESIS output). With just the 3 SAIGE results, I got this output for the APOE variant: chr19 44908684 rs429358 T C . . . NS:NC:ES:SE:LP:AF:AC:NE:I2:CQ:ED 725785:81165:1.04249:0.0155521:977.642:0.176641:223431:268674:98.6529:32.2391:+++

When I added the GENESIS summary data, I ended up with 3 lines for some reason.

chr19 44908684 rs429358;chr19:44908684:T:C T C . . . NS:ES:SE:LP:AF:AC:I2:CQ:ED 729318:1.03617:0.0150746:1027.89:0.178736:224924:98.0159:31.8383:++++ chr19 44908684 chr19:44908684:T:C T C . . . NS:ES:SE:LP:AF:AC:I2:CQ:ED 3533:0.937911:0.0613045:52.1114:0.211294:1493:.:.:???+ chr19 44908684 chr19:44908684:T:C T C . . . NS:ES:SE:LP:AF:AC:I2:CQ:ED 3533:0.937911:0.0613045:52.1114:0.211294:1493:.:.:???+

Any ideas or suggestions on this? The first line looks correct, not sure why the other 2 lines are in there.

Here is another line.

chr19   44909080        chr19:44909080:G:A      G       A       .       .       .       NS:ES:SE:LP:AF:AC:I2:CQ:ED      3540:-0.459873:0.373812:0.660325:0.00423729:30:.:.:???-
chr19   44909080        chr19:44909080:G:A      G       A       .       .       .       NS:ES:SE:LP:AF:AC:I2:CQ:ED      3540:-0.459873:0.373812:0.660325:0.00423729:30:.:.:???-
chr19   44909080        chr19:44909080:G:A      G       A       .       .       .       NS:ES:SE:LP:AF:AC:I2:CQ:ED      3540:-0.459873:0.373812:0.660325:0.00423729:30:.:.:???-

I can clean them out with grep -v '???-'.

Further checking shows that this is also happening with munging the 3 datasets.

freeseek commented 3 weeks ago

Can you give me a reference for the GENESIS bindings and can you give me a way to reproduce the METAL issue?

jjfarrell commented 3 weeks ago

@freeseek I will try go create some smaller gwas_vcfs to try to reproduce the issue with the APOE variant and get back to you. It may have to do with the number of threads being used which I will test out.

For the GENESIS bindings:

https://github.com/UW-GAC/GENESIS/blob/devel/man/assocTestSingle.Rd

\item{variant.id}{The variant ID}
\item{chr}{The chromosome value}
\item{pos}{The base pair position}
\item{allele.index}{The index of the alternate allele. For biallelic variants, this will always be 1.}
\item{n.obs}{The number of samples with non-missing genotypes}
\item{freq}{The estimated effect allele frequency}
\item{MAC}{The minor allele count. For multiallelic variants, "minor" is determined by comparing the count of the allele specified by \code{allele.index} with the sum of all other alleles.}
If \code{geno.coding} is \code{"recessive"}:
\item{n.hom.eff}{The number of samples homozygous for the effect allele.}
If \code{geno.coding} is \code{"dominant"}:
\item{n.any.eff}{The number of samples with any copies of the effect allele.}
If \code{test} is \code{"Score"}:
\item{Score}{The value of the score function}
\item{Score.SE}{The estimated standard error of the Score}
\item{Score.Stat}{The score Z test statistic}
\item{Score.pval}{The score p-value}
\item{Est}{An approximation of the effect size estimate for each additional copy of the effect allele}
\item{Est.SE}{An approximation of the standard error of the effect size estimate}
\item{PVE}{An approximation of the proportion of phenotype variance explained}
% If \code{test} is \code{"Wald"} and \code{GxE} is \code{NULL}:
% \item{Est}{The effect size estimate for each additional copy of the effect allele}
% \item{Est.SE}{The estimated standard error of the effect size estimate}
% \item{Wald.Stat}{The Wald Z test statistic}
% \item{Wald.pval}{The Wald p-value}
If \code{test} is \code{"Score.SPA"}:
\item{SPA.pval}{The score p-value after applying the saddle point approximation (SPA)}
\item{SPA.converged}{logical indiactor of whether the SPA converged; \code{NA} indicates that the SPA was not applied and the original Score.pval was returned}
If \code{GxE} is not \code{NULL}:
\item{Est.G}{The effect size estimate for the genotype term}
\item{Est.G:env}{The effect size estimate for the genotype*env interaction term. There will be as many of these terms as there are interaction variables, and "env" will be replaced with the variable name.}
\item{SE.G}{The estimated standard error of the genotype term effect size estimate}
\item{SE.G:env}{The estimated standard error of the genotype*env effect size estimate. There will be as many of these terms as there are interaction variables, and "env" will be replaced with the variable name.}
\item{GxE.Stat}{The Wald Z test statistic for the test of all genotype interaction terms.  When there is only one genotype interaction term, this is the test statistic for that term.}
\item{GxE.pval}{The Wald p-value for the test of all genotype interaction terms; i.e. the test of any genotype interaction effect}
\item{Joint.Stat}{The Wald Z test statistic for the joint test of the genotype term and all of the genotype interaction terms}
\item{Joint.pval}{The Wald p-value for the joint test of the genotype term and all of the genotype interaction terms; i.e. the test of any genotype effect}
If \code{test} is \code{"BinomiRare" or "CMP"}:
\item{n.carrier}{Number of individuals with at least one copy of the effect allele}
\item{n.D.carrier}{Number of cases with at least one copy of the effect allele}
\item{pval}{p-value}
\item{mid.pval}{mid-p-value}
freeseek commented 3 weeks ago

I don't see rsid, alt, ref, eff.frq, or Z among the GENESIS bindings. Maybe it would be simply better to add some of the bindings you are working with in colheaders.tsv rather than add a -c option for a non-conventional summary statistics format