frankvogt / vcf2gwas

Python API for comprehensive GWAS analysis using GEMMA
GNU General Public License v3.0
90 stars 29 forks source link

Error: missing 1 required positional argument: 'dir_temp' #21

Open adrianodemarino opened 1 year ago

adrianodemarino commented 1 year ago

I get this error when my VCF file it contains a lot of samples.

command executed: vcf2gwas -v chr20.Haplotypes.vcf.gz -pf height_data_ukbb_participant.csv -p height -cf covariants_nokinship_data_ukbb_participant.csv -c sex -lmm

$ head height_data_ukbb_participant.csv
,height
1000028,169
1000045,166
1000104,162
1000171,184
$ head covariants_nokinship_data_ukbb_participant.csv
,age,sex,pop
1000028,46,Female,British
1000045,53,Female,British
1000104,56,Female,British
1000171,42,Male,British

run:

vcf2gwas v0.8.8

Initialising..

Start time: Mon, 20 Feb 2023 22:31

Parsing arguments..
Genotype file: chr20.Haplotypes.vcf.gz
Phenotype file(s): height_data_ukbb_participant.csv
Covariate file: covariants_nokinship_data_ukbb_participant.csv

Arguments parsed successfully

Preparing files

Checking height_data_ukbb_participant.csv..
Phenotype distribution(s) successfully plotted

Indexing VCF file..
VCF file successfully indexed (Duration: 1 minute, 2.6 seconds)

Starting genotype Quality Control..
QC for Chromosome: 20
Quality control successful (Duration: 13 minutes, 14.2 seconds)

Filtering SNPs..
SNPs successfully filtered (Duration: 12 minutes, 11.7 seconds)

File preparations completed

Starting analysis..

Beginning with analysis of height_data_ukbb_participant.csv

Preparing files

Checking and adjusting files..
Chromosomes: 20
Checking individuals in VCF file..
Checking individuals in phenotype file..
Not all individuals in phenotype and genotype file match
Removed 2164 out of 487409 genotype individuals, 485245 remaining
Removed 987 out of 486232 phenotype individuals, 485245 remaining
Checking individuals in covariate file..
Not all individuals in covariate and genotype file match
Removed 0 out of 485245 genotype individuals, 485245 remaining
Removed 987 out of 486232 covariate individuals, 485245 remaining
In total, removed 2164 out of 487409 genotype individuals, 485245 remaining
Files successfully adjusted

Filtering and converting files

Converting to PLINK BED..
Successfully converted to PLINK BED (Duration: 2 minutes, 34.0 seconds)

Adding phenotypes/covariates to .fam file

Editing .fam file..
Phenotype(s) added to .fam file
Editing .fam file successful

Initialising GEMMA

Running GEMMA

Phenotypes to analyze: height

Creating relatedness matrix..
GEMMA 0.98.3 (2020-11-28) by Xiang Zhou and team (C) 2012-2020
Reading Files ...
## number of total individuals = 485245
## number of analyzed individuals = 485245
## number of covariates = 1
## number of phenotypes = 1
## number of total SNPs/var        =    15751
## number of analyzed SNPs         =    15724
Calculating Relatedness Matrix ...
Traceback (most recent call last):
  File "/Users/adriano/miniconda3/envs/gwas/lib/python3.9/site-packages/vcf2gwas/analysis.py", line 387, in <module>
    Gemma.write_returncodes(code, pc_prefix)
TypeError: write_returncodes() missing 1 required positional argument: 'dir_temp'

If I try to run the same exactly command using only a subset of samples, it works perfectly. I get the same error if instead of using option -c sex I use -ac

frankvogt commented 1 year ago

So the error message is caused by a small bug which I will fix soon but the actual reason for crashing is that gemma seems to be unable to complete the creation of the relatedness matrix when using your full dataset. Did you try different subsets of your VCF file to see if it is only a certain subgroup of SNPs that is causing the issue?

adrianodemarino commented 1 year ago

I didn't try with a different subgroup of SNPs. I only sub grouped the number of samples used (in one case 10 samples and in another one 5000 samples) and in both cases it worked perfectly. Could be that due to the amount of samples (~500k), the relatedness matrix wants more than only 15724 SNPs ?

frankvogt commented 1 year ago

I never read about that being an issue, maybe you could subset your sample size further to get an upper limit of how many samples gemma is able to use?

adrianodemarino commented 1 year ago

I will try to do that and let you know, thanks!