Closed ThiagoPL closed 2 years ago
Thanks for your information. Seems it is working fine at the start. And just for your information, the warning is normal is small sample size. The errors seems to be at some particular region so i suspect it may be some problem of genotype at that region.
Could you help try the following when you convert vcf to plink if you haven't done so:
plink2 --vcf ${vcf} \
--max-alleles 2 \
--rm-dup exclude-all \
--snps-only \
--make-pgen \
--maf 0.005 \
--out ${out}
You may also need to update the .lanc file, but for debugging purpose, you may start with only "ATT" option as that does not require local ancestry. I also update the latest code so it may be easier to identify the problems. After I identify the problem, i can modify the software so that it could also work on your original file.
To summarize,
pip install -U "dask[array]"
git clone https://github.com/KangchengHou/admix-kit
cd admix-kit
pip install -r requirements.txt; pip install -e .
Thanks for your patience
Hello,
Now I have a new problem. I am sending the logs from PLINK and the running.
Thank you very much, Thiago Peixoto Leal
This is very helpful.
Do you mind checking the 14875th and nearby variant in .pvar file to see if there's anything in particular about that?
Things to check may be SNP id, and see if all information looks okay
I got the SNP # 14870 until 14890. Looking I did not found anything weird (in bold it is the #14875).
22 20008717 chr22:20008717:A:G A G PASS AF=0.71234;MAF=0.28766;R2=0.97657;IMPUTED 22 20008732 chr22:20008732:A:G A G PASS AF=0.71233;MAF=0.28767;R2=0.97661;IMPUTED 22 20009046 chr22:20009046:G:T G T PASS AF=0.00968;MAF=0.00968;R2=0.99954;ER2=0.99500;TYPED 22 20009047 chr22:20009047:C:T C T PASS AF=0.47295;MAF=0.47295;R2=0.99991;ER2=0.99686;TYPED 22 20009306 chr22:20009306:G:A G A PASS AF=0.47133;MAF=0.47133;R2=0.98721;IMPUTED 22 20009355 chr22:20009355:G:C G C PASS AF=0.22394;MAF=0.22394;R2=0.99057;IMPUTED 22 20010565 chr22:20010565:G:A G A PASS AF=0.06261;MAF=0.06261;R2=0.92972;IMPUTED 22 20010681 chr22:20010681:C:T C T PASS AF=0.07420;MAF=0.07420;R2=0.99187;IMPUTED 22 20010863 chr22:20010863:T:C T C PASS AF=0.33933;MAF=0.33933;R2=0.99945;ER2=0.99432;TYPED 22 20010963 chr22:20010963:C:T C T PASS AF=0.01213;MAF=0.01213;R2=0.98889;IMPUTED 22 20011095 chr22:20011095:C:G C G PASS AF=0.25489;MAF=0.25489;R2=0.98695;IMPUTED 22 20011132 chr22:20011132:G:C G C PASS AF=0.55770;MAF=0.44230;R2=0.98305;IMPUTED 22 20011440 chr22:20011440:T:G T G PASS AF=0.05853;MAF=0.05853;R2=0.99200;IMPUTED 22 20011550 chr22:20011550:C:T C T PASS AF=0.05800;MAF=0.05800;R2=0.99140;IMPUTED 22 20011641 chr22:20011641:C:T C T PASS AF=0.65124;MAF=0.34876;R2=0.97894;IMPUTED 22 20011748 chr22:20011748:C:T C T PASS AF=0.05853;MAF=0.05853;R2=0.99202;IMPUTED 22 20011963 chr22:20011963:C:T C T PASS AF=0.25533;MAF=0.25533;R2=0.98323;IMPUTED 22 20011969 chr22:20011969:T:C T C PASS AF=0.71222;MAF=0.28778;R2=0.96746;IMPUTED 22 20012090 chr22:20012090:C:A C A PASS AF=0.05882;MAF=0.05882;R2=0.97860;IMPUTED 22 20012237 chr22:20012237:A:G A G PASS AF=0.24266;MAF=0.24266;R2=0.98184;IMPUTED 22 20012905 chr22:20012905:G:T G T PASS AF=0.09413;MAF=0.09413;R2=0.99981;ER2=0.99209;TYPED
It seems to be a issue from the PLINK2 pgenlib where i posted a question at https://groups.google.com/g/plink2-users/c/5Xxa-DDcXRY
Is it possible to extract the subset of SNPs that's causing this issue and share the data set for debugging purpose? We could think alternative routes if this is hard to operate.
alternatively, could you try the same code on chr21?
Hello, The info for the chromosome 21 in Log3.txt. I am checking the SNP extraction process. Log3.txt
If possible it would be great to have a minimal file (hopefully without sensitive privacy data) that can replicate this issue. Thanks a lot in advance.
This code can help replicate the issue.
import pgenlib
path = "/path/to/plink2.pgen"
n_indiv = 1480
snp_start = # start of snp, set this to be the error variable - 10
snp_stop = # stop of snp, set this to be the error variable + 10
with pgenlib.PgenReader(bytes(path, "utf8"), raw_sample_ct=n_indiv) as pgen:
print(pgen.get_raw_sample_ct())
geno = np.empty(
[snp_stop - snp_start, pgen.get_raw_sample_ct() * 2], dtype=np.int32
)
pgen.read_alleles_range(snp_start, snp_stop, geno)
# it should be expected this will also have error
Hello,
I ran the code and this just print the number of sample (1480). I used the range 14865 to 14885 and 14800 to 14900 and the result was the same.
I got the SNP and the SNPs closer. I select just imputed and I changed the ID on PSAM. Also, my PSAM original has AGE, SEX and 10 PCs.
Thank you
Indeed similar with your observation, I can not replicate the error with this sub-sampled data set. I wonder
(1) if you take the length of header into consideration when sub-sampling the data set.
(2) if you can have a sub-sampled data set (e.g. divide the whole data set into multiple region) and run admix assoc
in each of them to see at which region fails and share the corresponding anonymized data set.
My email is kangchenghou@gmail.com just in case you want to send some large file. Thanks a lot for your patience.
Hello,
I am sorry for the delay. I tried windows of 10,000 SNPs, and on the second window the problem happens (LogImputed.txt). I removed ALL genotyped SNPs, changed the IDs and the pheno file is odd/even assigned (BugImputed.zip). The problem happens even after these changes. I also got a problem on SNP1 on a file that worked to ATT and TRACTOR (LogSNP1.txt)
Hi,
I appreciate your help on this.
First, I am able to replicate the error on SNP1, which is indeed a bug which is fixed.
Second, I am still unable to replicate the error about the reading part. Attached is the script and log I got.
wget https://github.com/KangchengHou/admix-kit/files/7865739/BugImputed.zip
unzip BugImputed.zip
admix assoc --pfile ./BugImputed --pheno ./BugImputed.pheon --pheno-col DISEASE --method ATT,TRACTOR,SNP1 --out AssocTest21 --family binary
This makes me think the following:
pip uninstall pgenlib
pip install -U cython numpy
pip install -U git+https://github.com/KangchengHou/dask-pgen.git#egg=dask-pgen
and retry running the script.
2. This may be related to computing environement, any chance you can also replicate this in another computer, say, your PC?
Hello,
I tried to re-install and the problem still happen. I open a ticket with the HPC administrator to install. I also trying in a CentOS to see if it is some kind of libraries version.
Thank you very much
Hello,
Good news: the desktop version is running well, without problems. Sorry for all the problems.
PS: is there any way to get beta, OR, etc?
Thank you, Thiago Peixoto Leal
Great! I am in the middle to adding these effect sizes to the output. I should be able to follow up within two weeks.
Hello,
Sorry for bother you. I saw that the pgen lib was fixed and I tried to update to run on the HPC and when I tried to run I got:
admix assoc --pfile /home/peixott/beegfs/ADMIX-KIT/PFILES/LARGE_chr22_Biallelic_COVAR --pheno /home/peixott/beegfs/ADMIX-KIT/PFILES/LARGE_chr22_Biallelic_COVAR_pheno.txt --pheno-col DISEASE --method ATT,TRACTOR,SNP1 --out AssocTest22 --covar /home/peixott/beegfs/ADMIX-KIT/PFILES/LARGE_chr22_Biallelic_COVAR_covarNonNA.txt
2022-01-19 12:42.46 [info ] Received parameters:
assoc
--pfile=/home/peixott/beegfs/ADMIX-KIT/PFILES/LARGE_chr22_Biallelic_COVAR
--pheno=/home/peixott/beegfs/ADMIX-KIT/PFILES/LARGE_chr22_Biallelic_COVAR_pheno.txt
--pheno_col=DISEASE
--out=AssocTest22
--covar=/home/peixott/beegfs/ADMIX-KIT/PFILES/LARGE_chr22_Biallelic_COVAR_covarNonNA.txt
--method=('ATT', 'TRACTOR', 'SNP1')
--family=quant
--fast=True
2022-01-19 12:42.46 [info ] admix.Dataset: read local ancestry from /home/peixott/beegfs/ADMIX-KIT/PFILES/LARGE_chr22_Biallelic_COVAR.lanc
2022-01-19 12:42.46 [info ] admix.Dataset: n_anc
is not provided, infered n_anc from the first 1,000 SNPs is 3. If this is not correct, provide n_anc
when constructing admix.Dataset
2022-01-19 12:42.46 [info ] 140393 SNPs and 1480 individuals are loaded
2022-01-19 12:42.46 [warning ] admix.dataset.append_indiv_info: 1/1481 individuals in the new dataframe not in the dataset; These individuals will be ignored.
2022-01-19 12:42.46 [warning ] admix.dataset.append_indiv_info: 1/1481 individuals in the new dataframe not in the dataset; These individuals will be ignored.
2022-01-19 12:42.46 [info ] 12 covariates are loaded: AGE,SEX,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10
2022-01-19 12:42.46 [info ] 140393 SNPs and 1480 individuals left after filtering for missing phenotype, or completely missing covariate
2022-01-19 12:42.46 [info ] Detected categorical columns:
2022-01-19 12:42.46 [info ] Added dummy variables:
2022-01-19 12:42.46 [info ] Performing association analysis with method ATT
admix.assoc.marginal: 0%| | 0/138 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/peixott/.local/bin/admix", line 33, in
Invoked with: array([[0., 0., 0., ..., 0., 0., 1.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]], dtype=float32), array([[ 1.0000000e+00, 3.8000000e+01, 2.0000000e+00, ..., 2.8862273e-02, -1.5665960e-03, 2.3455237e-02], [ 1.0000000e+00, 3.3000000e+01, 1.0000000e+00, ..., 3.3046694e-02, -1.5794820e-03, 2.4544395e-02], [ 1.0000000e+00, 4.5000000e+01, 1.0000000e+00, ..., 2.2041136e-02, 4.7022500e-04, 1.7670383e-02], ..., [ 1.0000000e+00, 3.3000000e+01, 1.0000000e+00, ..., -1.4849876e-02, 2.0219720e-03, -1.5322012e-02], [ 1.0000000e+00, 3.6000000e+01, 1.0000000e+00, ..., -8.6413450e-03, 1.8475460e-03, -5.1883060e-03], [ 1.0000000e+00, 5.2000000e+01, 2.0000000e+00, ..., -1.3342459e-02, -2.0870400e-04, -1.3670151e-02]]), array([0, 0, 0, ..., 0, 0, 0]), 1, array([0]), array([[0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.], ..., [0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.]])
Hi,
Please update to the latest version following https://github.com/KangchengHou/admix-kit#update-to-latest
This is because i am also constantly updating other dependencies so you need to type a few lines.
On pgenlib, I am still not sure whether that will work in your case, it's worth trying though.
Also, I have added the effect sizes and SE for admix assoc, I will validate their accuracy in the next few days.
Hello,
I followed the procedure to update and I still getting that error message. I am testing on the computer that previously worked .
admix assoc --pfile ./PFILES/LARGE_chr22_Biallelic_COVAR --pheno ./PFILES/LARGE_chr22_Biallelic_COVAR_pheno.txt --pheno-col DISEASE --method ATT --out AssocTest22_ATT --family binary --covar ./PFILES/LARGE_chr22_Biallelic_COVAR_covarNonNA.txt
2022-01-21 11:53.54 [info ] Received parameters:
assoc
--pfile=./PFILES/LARGE_chr22_Biallelic_COVAR
--pheno=./PFILES/LARGE_chr22_Biallelic_COVAR_pheno.txt
--pheno_col=DISEASE
--out=AssocTest22_ATT
--covar=./PFILES/LARGE_chr22_Biallelic_COVAR_covarNonNA.txt
--method=ATT
--family=binary
--fast=True
2022-01-21 11:53.54 [info ] admix.Dataset: read local ancestry from ./PFILES/LARGE_chr22_Biallelic_COVAR.lanc
2022-01-21 11:53.54 [info ] admix.Dataset: n_anc
is not provided, infered n_anc from the first 1,000 SNPs is 3. If this is not correct, provide n_anc
when constructing admix.Dataset
2022-01-21 11:53.54 [info ] 140393 SNPs and 1480 individuals are loaded
2022-01-21 11:53.54 [warning ] admix.dataset.append_indiv_info: 1/1481 individuals in the new dataframe not in the dataset; These individuals will be ignored.
2022-01-21 11:53.54 [warning ] admix.dataset.append_indiv_info: 1/1481 individuals in the new dataframe not in the dataset; These individuals will be ignored.
2022-01-21 11:53.54 [info ] 12 covariates are loaded: AGE,SEX,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10
2022-01-21 11:53.54 [info ] 140393 SNPs and 1480 individuals left after filtering for missing phenotype, or completely missing covariate
2022-01-21 11:53.54 [info ] Detected categorical columns:
2022-01-21 11:53.54 [info ] Added dummy variables:
2022-01-21 11:53.54 [info ] Performing association analysis with method ATT
admix.assoc.marginal: 0%| | 0/138 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/peixott/.local/bin/admix", line 33, in
Invoked with: array([[0., 0., 0., ..., 1., 0., 1.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]], dtype=float32), array([[ 1.0000000e+00, 3.8000000e+01, 2.0000000e+00, ..., 2.8862273e-02, -1.5665960e-03, 2.3455237e-02], [ 1.0000000e+00, 3.3000000e+01, 1.0000000e+00, ..., 3.3046694e-02, -1.5794820e-03, 2.4544395e-02], [ 1.0000000e+00, 4.5000000e+01, 1.0000000e+00, ..., 2.2041136e-02, 4.7022500e-04, 1.7670383e-02], ..., [ 1.0000000e+00, 3.3000000e+01, 1.0000000e+00, ..., -1.4849876e-02, 2.0219720e-03, -1.5322012e-02], [ 1.0000000e+00, 3.6000000e+01, 1.0000000e+00, ..., -8.6413450e-03, 1.8475460e-03, -5.1883060e-03], [ 1.0000000e+00, 5.2000000e+01, 2.0000000e+00, ..., -1.3342459e-02, -2.0870400e-04, -1.3670151e-02]]), array([0, 0, 0, ..., 0, 0, 0]), 1, array([0]), array([[0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.], ..., [0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.]]), 100, 1e-06
Seems the tinygwas module is not updated to the latest, could you redo
pip install -U git+https://github.com/bogdanlab/tinygwas.git#egg=tinygwas
and confirm tinygwas-0.2 is installed?
Hello,
After run this line the methodology worked again. Thanks
Hello,
I am sending the information about my trying to run the ATT, TRACTOR and SNP1 using our data (case and control for Parkinson Disease).
Our covar files has 13 columns: indiv AGE SEX PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
Our pheno file has 2 columns: indiv DISEASE
Our pfile is from imputed data (VCF to PFILE) with a modified PSAM:
$ head ./PFILES/LARGE_chr22_Biallelic_COVAR.psam -n 1
IID AGE SEX PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
Thank you very much, Thiago Peixoto Leal Log.txt