KangchengHou / admix-kit

Toolkit for analyzing genetics data from admixed populations
https://kangchenghou.github.io/admix-kit
22 stars 4 forks source link

Error in Admix assoc marginal (binary) #14

Closed ThiagoPL closed 2 years ago

ThiagoPL commented 2 years ago

Hello,

I am sending the information about my trying to run the ATT, TRACTOR and SNP1 using our data (case and control for Parkinson Disease).

Our covar files has 13 columns: indiv AGE SEX PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10

Our pheno file has 2 columns: indiv DISEASE

Our pfile is from imputed data (VCF to PFILE) with a modified PSAM:

$ head ./PFILES/LARGE_chr22_Biallelic_COVAR.psam -n 1

IID AGE SEX PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10

Thank you very much, Thiago Peixoto Leal Log.txt

KangchengHou commented 2 years ago

Thanks for your information. Seems it is working fine at the start. And just for your information, the warning is normal is small sample size. The errors seems to be at some particular region so i suspect it may be some problem of genotype at that region.

Could you help try the following when you convert vcf to plink if you haven't done so:

plink2 --vcf ${vcf} \
    --max-alleles 2 \
    --rm-dup exclude-all \
    --snps-only \
    --make-pgen \
    --maf 0.005 \
    --out ${out}

You may also need to update the .lanc file, but for debugging purpose, you may start with only "ATT" option as that does not require local ancestry. I also update the latest code so it may be easier to identify the problems. After I identify the problem, i can modify the software so that it could also work on your original file.

To summarize,

  1. Clone the latest code and reinstall
pip install -U "dask[array]"
git clone https://github.com/KangchengHou/admix-kit
cd admix-kit
pip install -r requirements.txt; pip install -e .
  1. Double check PLINK2 file.

Thanks for your patience

ThiagoPL commented 2 years ago

Hello,

Now I have a new problem. I am sending the logs from PLINK and the running.

Thank you very much, Thiago Peixoto Leal

Log2.txt

KangchengHou commented 2 years ago

This is very helpful.

Do you mind checking the 14875th and nearby variant in .pvar file to see if there's anything in particular about that?

Things to check may be SNP id, and see if all information looks okay

ThiagoPL commented 2 years ago

I got the SNP # 14870 until 14890. Looking I did not found anything weird (in bold it is the #14875).

pipeline=michigan-imputationserver-1.5.7

imputation=minimac4-1.0.2

phasing=eagle-2.4

r2Filter=0.2

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

22 20008717 chr22:20008717:A:G A G PASS AF=0.71234;MAF=0.28766;R2=0.97657;IMPUTED 22 20008732 chr22:20008732:A:G A G PASS AF=0.71233;MAF=0.28767;R2=0.97661;IMPUTED 22 20009046 chr22:20009046:G:T G T PASS AF=0.00968;MAF=0.00968;R2=0.99954;ER2=0.99500;TYPED 22 20009047 chr22:20009047:C:T C T PASS AF=0.47295;MAF=0.47295;R2=0.99991;ER2=0.99686;TYPED 22 20009306 chr22:20009306:G:A G A PASS AF=0.47133;MAF=0.47133;R2=0.98721;IMPUTED 22 20009355 chr22:20009355:G:C G C PASS AF=0.22394;MAF=0.22394;R2=0.99057;IMPUTED 22 20010565 chr22:20010565:G:A G A PASS AF=0.06261;MAF=0.06261;R2=0.92972;IMPUTED 22 20010681 chr22:20010681:C:T C T PASS AF=0.07420;MAF=0.07420;R2=0.99187;IMPUTED 22 20010863 chr22:20010863:T:C T C PASS AF=0.33933;MAF=0.33933;R2=0.99945;ER2=0.99432;TYPED 22 20010963 chr22:20010963:C:T C T PASS AF=0.01213;MAF=0.01213;R2=0.98889;IMPUTED 22 20011095 chr22:20011095:C:G C G PASS AF=0.25489;MAF=0.25489;R2=0.98695;IMPUTED 22 20011132 chr22:20011132:G:C G C PASS AF=0.55770;MAF=0.44230;R2=0.98305;IMPUTED 22 20011440 chr22:20011440:T:G T G PASS AF=0.05853;MAF=0.05853;R2=0.99200;IMPUTED 22 20011550 chr22:20011550:C:T C T PASS AF=0.05800;MAF=0.05800;R2=0.99140;IMPUTED 22 20011641 chr22:20011641:C:T C T PASS AF=0.65124;MAF=0.34876;R2=0.97894;IMPUTED 22 20011748 chr22:20011748:C:T C T PASS AF=0.05853;MAF=0.05853;R2=0.99202;IMPUTED 22 20011963 chr22:20011963:C:T C T PASS AF=0.25533;MAF=0.25533;R2=0.98323;IMPUTED 22 20011969 chr22:20011969:T:C T C PASS AF=0.71222;MAF=0.28778;R2=0.96746;IMPUTED 22 20012090 chr22:20012090:C:A C A PASS AF=0.05882;MAF=0.05882;R2=0.97860;IMPUTED 22 20012237 chr22:20012237:A:G A G PASS AF=0.24266;MAF=0.24266;R2=0.98184;IMPUTED 22 20012905 chr22:20012905:G:T G T PASS AF=0.09413;MAF=0.09413;R2=0.99981;ER2=0.99209;TYPED

KangchengHou commented 2 years ago

It seems to be a issue from the PLINK2 pgenlib where i posted a question at https://groups.google.com/g/plink2-users/c/5Xxa-DDcXRY

Is it possible to extract the subset of SNPs that's causing this issue and share the data set for debugging purpose? We could think alternative routes if this is hard to operate.

KangchengHou commented 2 years ago

alternatively, could you try the same code on chr21?

ThiagoPL commented 2 years ago

Hello, The info for the chromosome 21 in Log3.txt. I am checking the SNP extraction process. Log3.txt

KangchengHou commented 2 years ago

If possible it would be great to have a minimal file (hopefully without sensitive privacy data) that can replicate this issue. Thanks a lot in advance.

This code can help replicate the issue.

import pgenlib
path = "/path/to/plink2.pgen"
n_indiv = 1480
snp_start = # start of snp, set this to be the error variable - 10
snp_stop = # stop of snp, set this to be the error variable + 10
with pgenlib.PgenReader(bytes(path, "utf8"), raw_sample_ct=n_indiv) as pgen:
    print(pgen.get_raw_sample_ct())
    geno = np.empty(
        [snp_stop - snp_start, pgen.get_raw_sample_ct() * 2], dtype=np.int32
    )
    pgen.read_alleles_range(snp_start, snp_stop, geno)

# it should be expected this will also have error
ThiagoPL commented 2 years ago

Hello,

I ran the code and this just print the number of sample (1480). I used the range 14865 to 14885 and 14800 to 14900 and the result was the same.

I got the SNP and the SNPs closer. I select just imputed and I changed the ID on PSAM. Also, my PSAM original has AGE, SEX and 10 PCs.

Thank you

Data_Debug.zip

KangchengHou commented 2 years ago

Indeed similar with your observation, I can not replicate the error with this sub-sampled data set. I wonder

(1) if you take the length of header into consideration when sub-sampling the data set. (2) if you can have a sub-sampled data set (e.g. divide the whole data set into multiple region) and run admix assoc in each of them to see at which region fails and share the corresponding anonymized data set.

My email is kangchenghou@gmail.com just in case you want to send some large file. Thanks a lot for your patience.

ThiagoPL commented 2 years ago

Hello,

I am sorry for the delay. I tried windows of 10,000 SNPs, and on the second window the problem happens (LogImputed.txt). I removed ALL genotyped SNPs, changed the IDs and the pheno file is odd/even assigned (BugImputed.zip). The problem happens even after these changes. I also got a problem on SNP1 on a file that worked to ATT and TRACTOR (LogSNP1.txt)

BugImputed.zip LogImputed.txt LogSNP1.txt

KangchengHou commented 2 years ago

Hi,

I appreciate your help on this.

First, I am able to replicate the error on SNP1, which is indeed a bug which is fixed.

Second, I am still unable to replicate the error about the reading part. Attached is the script and log I got.

wget https://github.com/KangchengHou/admix-kit/files/7865739/BugImputed.zip
unzip BugImputed.zip
admix assoc --pfile ./BugImputed --pheno ./BugImputed.pheon --pheno-col DISEASE --method ATT,TRACTOR,SNP1 --out AssocTest21 --family binary

log.txt

This makes me think the following:

  1. This may be library version problem, so could you try the following:
    
    pip uninstall pgenlib
    pip install -U cython numpy
    pip install -U git+https://github.com/KangchengHou/dask-pgen.git#egg=dask-pgen

pull and reinstall admix-kit


and retry running the script.

2. This may be related to computing environement, any chance you can also replicate this in another computer, say, your PC?
ThiagoPL commented 2 years ago

Hello,

I tried to re-install and the problem still happen. I open a ticket with the HPC administrator to install. I also trying in a CentOS to see if it is some kind of libraries version.

Thank you very much

ThiagoPL commented 2 years ago

Hello,

Good news: the desktop version is running well, without problems. Sorry for all the problems.

PS: is there any way to get beta, OR, etc?

Thank you, Thiago Peixoto Leal

KangchengHou commented 2 years ago

Great! I am in the middle to adding these effect sizes to the output. I should be able to follow up within two weeks.

ThiagoPL commented 2 years ago

Hello,

Sorry for bother you. I saw that the pgen lib was fixed and I tried to update to run on the HPC and when I tried to run I got:

admix assoc --pfile /home/peixott/beegfs/ADMIX-KIT/PFILES/LARGE_chr22_Biallelic_COVAR --pheno /home/peixott/beegfs/ADMIX-KIT/PFILES/LARGE_chr22_Biallelic_COVAR_pheno.txt --pheno-col DISEASE --method ATT,TRACTOR,SNP1 --out AssocTest22 --covar /home/peixott/beegfs/ADMIX-KIT/PFILES/LARGE_chr22_Biallelic_COVAR_covarNonNA.txt 2022-01-19 12:42.46 [info ] Received parameters: assoc --pfile=/home/peixott/beegfs/ADMIX-KIT/PFILES/LARGE_chr22_Biallelic_COVAR --pheno=/home/peixott/beegfs/ADMIX-KIT/PFILES/LARGE_chr22_Biallelic_COVAR_pheno.txt --pheno_col=DISEASE --out=AssocTest22 --covar=/home/peixott/beegfs/ADMIX-KIT/PFILES/LARGE_chr22_Biallelic_COVAR_covarNonNA.txt --method=('ATT', 'TRACTOR', 'SNP1') --family=quant --fast=True 2022-01-19 12:42.46 [info ] admix.Dataset: read local ancestry from /home/peixott/beegfs/ADMIX-KIT/PFILES/LARGE_chr22_Biallelic_COVAR.lanc 2022-01-19 12:42.46 [info ] admix.Dataset: n_anc is not provided, infered n_anc from the first 1,000 SNPs is 3. If this is not correct, provide n_anc when constructing admix.Dataset 2022-01-19 12:42.46 [info ] 140393 SNPs and 1480 individuals are loaded 2022-01-19 12:42.46 [warning ] admix.dataset.append_indiv_info: 1/1481 individuals in the new dataframe not in the dataset; These individuals will be ignored. 2022-01-19 12:42.46 [warning ] admix.dataset.append_indiv_info: 1/1481 individuals in the new dataframe not in the dataset; These individuals will be ignored. 2022-01-19 12:42.46 [info ] 12 covariates are loaded: AGE,SEX,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10 2022-01-19 12:42.46 [info ] 140393 SNPs and 1480 individuals left after filtering for missing phenotype, or completely missing covariate 2022-01-19 12:42.46 [info ] Detected categorical columns: 2022-01-19 12:42.46 [info ] Added dummy variables: 2022-01-19 12:42.46 [info ] Performing association analysis with method ATT admix.assoc.marginal: 0%| | 0/138 [00:00<?, ?it/s] Traceback (most recent call last): File "/home/peixott/.local/bin/admix", line 33, in sys.exit(load_entry_point('admix-kit', 'console_scripts', 'admix')()) File "/mnt/beegfs/peixott/admix-kit/admix/cli/init.py", line 36, in cli fire.Fire() File "/home/peixott/.local/lib/python3.8/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/peixott/.local/lib/python3.8/site-packages/fire/core.py", line 466, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/peixott/.local/lib/python3.8/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/mnt/beegfs/peixott/admix-kit/admix/cli/_assoc.py", line 113, in assoc dict_rls[m] = admix.assoc.marginal( File "/mnt/beegfs/peixott/admix-kit/admix/assoc/init.py", line 367, in marginal _block_test( File "/mnt/beegfs/peixott/admix-kit/admix/assoc/init.py", line 91, in _block_test tinygwas.linear_f_test(var, cov, pheno, var_size, test_vars, res) TypeError: linear_f_test(): incompatible function arguments. The following argument types are supported:

  1. (arg0: numpy.ndarray[numpy.float64[m, n]], arg1: numpy.ndarray[numpy.float64[m, n]], arg2: numpy.ndarray[numpy.float64[m, 1]], arg3: int, arg4: List[int], arg5: numpy.ndarray[numpy.float64[m, 1], flags.writeable], arg6: numpy.ndarray[numpy.float64[m, 1], flags.writeable]) -> None

Invoked with: array([[0., 0., 0., ..., 0., 0., 1.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]], dtype=float32), array([[ 1.0000000e+00, 3.8000000e+01, 2.0000000e+00, ..., 2.8862273e-02, -1.5665960e-03, 2.3455237e-02], [ 1.0000000e+00, 3.3000000e+01, 1.0000000e+00, ..., 3.3046694e-02, -1.5794820e-03, 2.4544395e-02], [ 1.0000000e+00, 4.5000000e+01, 1.0000000e+00, ..., 2.2041136e-02, 4.7022500e-04, 1.7670383e-02], ..., [ 1.0000000e+00, 3.3000000e+01, 1.0000000e+00, ..., -1.4849876e-02, 2.0219720e-03, -1.5322012e-02], [ 1.0000000e+00, 3.6000000e+01, 1.0000000e+00, ..., -8.6413450e-03, 1.8475460e-03, -5.1883060e-03], [ 1.0000000e+00, 5.2000000e+01, 2.0000000e+00, ..., -1.3342459e-02, -2.0870400e-04, -1.3670151e-02]]), array([0, 0, 0, ..., 0, 0, 0]), 1, array([0]), array([[0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.], ..., [0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.]])

KangchengHou commented 2 years ago

Hi,

Please update to the latest version following https://github.com/KangchengHou/admix-kit#update-to-latest

This is because i am also constantly updating other dependencies so you need to type a few lines.

On pgenlib, I am still not sure whether that will work in your case, it's worth trying though.

Also, I have added the effect sizes and SE for admix assoc, I will validate their accuracy in the next few days.

ThiagoPL commented 2 years ago

Hello,

I followed the procedure to update and I still getting that error message. I am testing on the computer that previously worked .

admix assoc --pfile ./PFILES/LARGE_chr22_Biallelic_COVAR --pheno ./PFILES/LARGE_chr22_Biallelic_COVAR_pheno.txt --pheno-col DISEASE --method ATT --out AssocTest22_ATT --family binary --covar ./PFILES/LARGE_chr22_Biallelic_COVAR_covarNonNA.txt 2022-01-21 11:53.54 [info ] Received parameters: assoc --pfile=./PFILES/LARGE_chr22_Biallelic_COVAR --pheno=./PFILES/LARGE_chr22_Biallelic_COVAR_pheno.txt --pheno_col=DISEASE --out=AssocTest22_ATT --covar=./PFILES/LARGE_chr22_Biallelic_COVAR_covarNonNA.txt --method=ATT --family=binary --fast=True 2022-01-21 11:53.54 [info ] admix.Dataset: read local ancestry from ./PFILES/LARGE_chr22_Biallelic_COVAR.lanc 2022-01-21 11:53.54 [info ] admix.Dataset: n_anc is not provided, infered n_anc from the first 1,000 SNPs is 3. If this is not correct, provide n_anc when constructing admix.Dataset 2022-01-21 11:53.54 [info ] 140393 SNPs and 1480 individuals are loaded 2022-01-21 11:53.54 [warning ] admix.dataset.append_indiv_info: 1/1481 individuals in the new dataframe not in the dataset; These individuals will be ignored. 2022-01-21 11:53.54 [warning ] admix.dataset.append_indiv_info: 1/1481 individuals in the new dataframe not in the dataset; These individuals will be ignored. 2022-01-21 11:53.54 [info ] 12 covariates are loaded: AGE,SEX,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10 2022-01-21 11:53.54 [info ] 140393 SNPs and 1480 individuals left after filtering for missing phenotype, or completely missing covariate 2022-01-21 11:53.54 [info ] Detected categorical columns: 2022-01-21 11:53.54 [info ] Added dummy variables: 2022-01-21 11:53.54 [info ] Performing association analysis with method ATT admix.assoc.marginal: 0%| | 0/138 [00:00<?, ?it/s] Traceback (most recent call last): File "/home/peixott/.local/bin/admix", line 33, in sys.exit(load_entry_point('admix-kit', 'console_scripts', 'admix')()) File "/home/peixott/admix-kit/admix/cli/init.py", line 37, in cli fire.Fire() File "/home/peixott/.local/lib/python3.8/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/peixott/.local/lib/python3.8/site-packages/fire/core.py", line 466, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/peixott/.local/lib/python3.8/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/home/peixott/admix-kit/admix/cli/_assoc.py", line 113, in assoc dict_rls[m] = admix.assoc.marginal( File "/home/peixott/admix-kit/admix/assoc/init.py", line 367, in marginal _block_test( File "/home/peixott/admix-kit/admix/assoc/init.py", line 105, in _block_test tinygwas.logistic_lrt( TypeError: logistic_lrt(): incompatible function arguments. The following argument types are supported:

  1. (arg0: numpy.ndarray[numpy.float64[m, n]], arg1: numpy.ndarray[numpy.float64[m, n]], arg2: numpy.ndarray[numpy.float64[m, 1]], arg3: int, arg4: List[int], arg5: int, arg6: float) -> numpy.ndarray[numpy.float64[m, 1]]

Invoked with: array([[0., 0., 0., ..., 1., 0., 1.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]], dtype=float32), array([[ 1.0000000e+00, 3.8000000e+01, 2.0000000e+00, ..., 2.8862273e-02, -1.5665960e-03, 2.3455237e-02], [ 1.0000000e+00, 3.3000000e+01, 1.0000000e+00, ..., 3.3046694e-02, -1.5794820e-03, 2.4544395e-02], [ 1.0000000e+00, 4.5000000e+01, 1.0000000e+00, ..., 2.2041136e-02, 4.7022500e-04, 1.7670383e-02], ..., [ 1.0000000e+00, 3.3000000e+01, 1.0000000e+00, ..., -1.4849876e-02, 2.0219720e-03, -1.5322012e-02], [ 1.0000000e+00, 3.6000000e+01, 1.0000000e+00, ..., -8.6413450e-03, 1.8475460e-03, -5.1883060e-03], [ 1.0000000e+00, 5.2000000e+01, 2.0000000e+00, ..., -1.3342459e-02, -2.0870400e-04, -1.3670151e-02]]), array([0, 0, 0, ..., 0, 0, 0]), 1, array([0]), array([[0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.], ..., [0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.]]), 100, 1e-06

KangchengHou commented 2 years ago

Seems the tinygwas module is not updated to the latest, could you redo

pip install -U git+https://github.com/bogdanlab/tinygwas.git#egg=tinygwas

and confirm tinygwas-0.2 is installed?

ThiagoPL commented 2 years ago

Hello,

After run this line the methodology worked again. Thanks