genetics-statistics / GEMMA

Genome-wide Efficient Mixed Model Association
https://github.com/genetics-statistics/GEMMA
GNU General Public License v3.0
334 stars 126 forks source link

Number of total SNPs and number of analyzed SNPs #107

Closed angelaparodymerino closed 7 years ago

angelaparodymerino commented 7 years ago

Hi!

I run a BSLMM in GEMMA and this is what appears in the terminal:

$ gemma -g bimbampcas2.gen -p pheno_catlinsPlus10.txt -k k_thin_100000.txt -bslmm 2 -notsnp -o out2
Reading Files ... 
## number of total individuals = 265
## number of analyzed individuals = 265
## number of covariates = 1
## number of phenotypes = 1
## number of total SNPs = 3414
## number of analyzed SNPs = 3414
Start Eigen-Decomposition...
pve estimate =0.1963
se(pve) =0.0584786
Calculating UtX...

I included the covariates (PC1 and PC2) in the bimbam file, so this bimbampcas2.gen contains two rows with these covariates, therefore it contains 3412 SNPs (rows) + 2 PCs (rows)= 3414 rows. I was wondering why I get that the number of total SNPs = 3414 and not 3412. I wonder the same thing about number of analyzed SNPs, which it says it is 3414 instead of 3412. Should I worry? There is something I am doing wrong? I used -notsnp so I suppose it knows that there are covariates in the bimbampcas2.gen file.

Thanks in advance,

Regards,

Ángela

pcarbo commented 7 years ago

@angelaparodymerino The covariate data should be in a separate file, and added to your analysis with the -c option. Please see Sec. 3.4 of the manual a well as gemma -h 2.

angelaparodymerino commented 7 years ago

Hi pcarbo, thanks for your answer.

GEMMA's manual says (page 24): "In addition, GEMMA does not take covariates le when fitting BSLMM. However, one can use the BIMBAM mean genotype le to store these covariates and use -notsnp" option to use them."_ so that is what I did. Specifically, to do what it is said in the manual I did the following:

1) I used bcftools: bcftools convert -gensample filename.vcf

I got a .gen/.sample file.

2) I used the bash command given in GEMMA's manual to convert this .gen file into bimbam mean genotype file: cat filename.gen | awk -v s=265 ’{ printf $2 "," $4 "," $5; for(i=1; i<=s; i++) printf "," $(i3+3)2+$(i*3+4); printf "\n" }’ > bimbamfilename.gen

2) I store pc1 and pc2 in a .txt file, in two rows like this (note that the alleles are just artificial alleles):

pc1,A,T,-0.0259,-1.6238, etc...
pc2,G,C,-0.0757,-0.0848, etc...

3) I added/appended the two rows of the pcs at the end of the bimbam mean genotype file with: cat bimbamfilename.gen pcas.txt > bimbampcas.gen

4) Then I used -notsnp, which I guess indicates the program that in the bimbam mean genotype file there are some columns that are not snps but covariates (population structure information).

Is doing this equivalent to provide covariates in a separate file (in the format explained in section 3.4. of the manual)?

Thanks in advance,

'Angela Parody-Merino

angelaparodymerino commented 7 years ago

Well, my previous comment has a different question than the title of this entry. But, well, I have just explained what I have done and I still have the doubt about why

## number of total SNPs = 3414
## number of analyzed SNPs = 3414

are not 3412, which are my actual number of SNPs. It must be counting the two last rows of the pcs, but it is confusing because I used -notsnp so it shouldn't take those two last rows as real snps (according to the manual). Thanks,

Cheers,

'Angela

pcarbo commented 7 years ago

@angelaparodymerino My apologies, yes, you are right.

I think you are doing this correctly. I would view all the rows of your BIMBAM file as general covariates, that may or may not be SNPs. So I suppose we should replace the message number of total SNPs = 3414 with number of total variables = 3414, which would be more correct.

In any case, just ignore the count of the number of SNPs.

angelaparodymerino commented 7 years ago

Ok, thanks. I just needed to reassure that what I was doing made sense.

Maybe number of total variables would be less confusing and more accurate than number of total SNPs.

Thanks again!

'Angela Parody-Merino