KangchengHou / admix-kit

Toolkit for analyzing genetics data from admixed populations
https://kangchenghou.github.io/admix-kit
22 stars 5 forks source link

Error in assoc-quant #13

Closed nrosewick closed 2 years ago

nrosewick commented 2 years ago

I follow the different steps in order to perform association analysis

I have a .lanc file for mydataset.chr1 generated from admix lanc

admix assoc-quant --pfile data/mydataset.chr1 --pheno samples_pheno.txt --pheno-col PHENO --method ATT --out test.assoc

samples_pheno.txt has the same samples order and has two columns. col1 (indiv) contains sample id and col2 (PHENO) contains phenotype (2 : cases ; 1 : controls )

An error occured :

admix.Dataset: `n_anc` is not provided, infered n_anc from the first 1,000 SNPs is 2. If this is not correct, provide `n_anc` when constructing admix.Dataset
Traceback (most recent call last):
  File "admix", line 392, in <module>
    fire.Fire()
  File "/home/nicolas/anaconda3/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/nicolas/anaconda3/lib/python3.8/site-packages/fire/core.py", line 466, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/nicolas/anaconda3/lib/python3.8/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "../bin/admix", line 382, in assoc_quant
    dict_rls[m] = admix.assoc.marginal(
  File "/home/nicolas/Documents/admix-kit/admix/assoc/__init__.py", line 260, in marginal
    model = sm.OLS(pheno, design).fit(disp=0)
  File "/home/nicolas/anaconda3/lib/python3.8/site-packages/statsmodels/regression/linear_model.py", line 872, in __init__
    super(OLS, self).__init__(endog, exog, missing=missing,
  File "/home/nicolas/anaconda3/lib/python3.8/site-packages/statsmodels/regression/linear_model.py", line 703, in __init__
    super(WLS, self).__init__(endog, exog, missing=missing,
  File "/home/nicolas/anaconda3/lib/python3.8/site-packages/statsmodels/regression/linear_model.py", line 190, in __init__
    super(RegressionModel, self).__init__(endog, exog, **kwargs)
  File "/home/nicolas/anaconda3/lib/python3.8/site-packages/statsmodels/base/model.py", line 237, in __init__
    super(LikelihoodModel, self).__init__(endog, exog, **kwargs)
  File "/home/nicolas/anaconda3/lib/python3.8/site-packages/statsmodels/base/model.py", line 77, in __init__
    self.data = self._handle_data(endog, exog, missing, hasconst,
  File "/home/nicolas/anaconda3/lib/python3.8/site-packages/statsmodels/base/model.py", line 101, in _handle_data
    data = handle_data(endog, exog, missing, hasconst, **kwargs)
  File "/home/nicolas/anaconda3/lib/python3.8/site-packages/statsmodels/base/data.py", line 672, in handle_data
    return klass(endog, exog=exog, missing=missing, hasconst=hasconst,
  File "/home/nicolas/anaconda3/lib/python3.8/site-packages/statsmodels/base/data.py", line 87, in __init__
    self._handle_constant(hasconst)
  File "/home/nicolas/anaconda3/lib/python3.8/site-packages/statsmodels/base/data.py", line 133, in _handle_constant
    raise MissingDataError('exog contains inf or nans')
statsmodels.tools.sm_exceptions.MissingDataError: exog contains inf or nans

Also when will the .lanc file will be used in the association as we do not provide it to the function ?

Thank you

KangchengHou commented 2 years ago

Thanks for the interest in this software.

On first error, it seems that there is some missing data in genotype or phenotype (which is not properly handled currently). I will work on fixing that.

Also when will the .lanc file will be used in the association as we do not provide it to the function ?

.lanc file will always be read if there is a .lanc, however will only be used when using association testing methods that require local ancestries, such as SNP1, and Tractor.

I will make it clear in the document.

nrosewick commented 2 years ago

So should I first filter the plink file to remove variants with missing genotype (--geno 0 option in plink). This will remove a significant fraction of the variants in my dataset though.

Thank you

nrosewick commented 2 years ago

Also question related to the .lanc file. How will the tool know where the .lanc is ? Maybe adding a --lanc parameter in the assoc-quant tool is a good idea in order to avoid ambiguity.

Thank you for the great work

KangchengHou commented 2 years ago

Yes i agree you shouldn't use --geno 0. I will implement the genotype imputation within the software, by imputing the missing genotype with average for each SNP. I'll try to work on this early next week.

Currently, i assumed the file structure is like: for every genotype file, there are four or more associated files

So local ancestry file must be associated with a [pfile]. This makes sure that the .lanc (which corresponds to a matrix) always match a corresponding genotype file. I believe this makes the files more structured, rather than assuming that users would know the correspondence. What do you think?

nrosewick commented 2 years ago

Is it not possible to perform regression only on the samples without missing genotype ? In general plink dataset are preprocessed to filter out SNP with more than 5 or 10% of samples with missing GT. Imputation may induce errors as GT can be very ancestry dependent.

Nicolas

Le jeu. 2 déc. 2021 à 18:10, Kangcheng Hou @.***> a écrit :

Yes i agree you shouldn't use --geno 0. I will implement the genotype imputation within the software, by imputing the missing genotype with average for each SNP. I'll try to work on this early next week.

Currently, i assumed the file structure is like: for every genotype file, there are four or more associated files

  • [pfile].pgen
  • [pfile].pfam
  • [pfile].pvar
  • [pfile].lanc

So local ancestry file must be associated with a [pfile]. This makes sure that the .lanc (which corresponds to a matrix) always match a corresponding genotype file. I believe this makes the files more structured, rather than assuming that users would know the correspondence. What do you think?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/KangchengHou/admix-kit/issues/13#issuecomment-984826033, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3PHXVCNBP7DOUSLCCV26DUO6R7DANCNFSM5JCDFEEQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

KangchengHou commented 2 years ago

Dear Nicolas,

Thanks for your suggestion. I have updated the assoc-quant to cope with potential missing genotype and phenotype (if there are missing genotype, the corresponding individuals will be dropped).

Since I updated the code a bit, do you mind following the install and Quick start (command line interface) process (i also add the toy covariate file) again and let me know how it works on your data? Thanks!

Kangcheng

nrosewick commented 2 years ago

Thank you for the fast update of your work.

I tried the new code on the toy dataset and I've an error at the assoc-quant step :

Any idea where I messed up ? Did I miss some package maybe ?

Traceback (most recent call last):
  File "../bin/admix", line 440, in <module>
    fire.Fire()
  File "/home/nicolas/anaconda3/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/nicolas/anaconda3/lib/python3.8/site-packages/fire/core.py", line 466, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/nicolas/anaconda3/lib/python3.8/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "../bin/admix", line 425, in assoc_quant
    dict_rls[m] = admix.assoc.marginal(
  File "/home/nicolas/Documents/erasme/projects/HECAM/admix-kit/admix/assoc/__init__.py", line 234, in marginal
    var[:, 0::3] = allele_per_anc[:, :, 0]
  File "/home/nicolas/anaconda3/lib/python3.8/site-packages/dask/array/core.py", line 1657, in __setitem__
    raise NotImplementedError(
NotImplementedError: Item assignment with <class 'tuple'> not supported
KangchengHou commented 2 years ago

Thanks for your patience. It should be the dask version problem. Could you run pip install -U "dask[array]" and try again? I also updated the requirements.txt file

nrosewick commented 2 years ago

Thank you. admix-kit is now working well :-)

I will process my whole dataset and keep you inform if I have some questions.

Thank you again for the great work :+1:

ThiagoPL commented 2 years ago

Thanks for the interest in this software.

On first error, it seems that there is some missing data in genotype or phenotype (which is not properly handled currently). I will work on fixing that.

Also when will the .lanc file will be used in the association as we do not provide it to the function ?

.lanc file will always be read if there is a .lanc, however will only be used when using association testing methods that require local ancestries, such as SNP1, and Tractor.

I will make it clear in the document.

  • [x] missing data in genotype or phenotype in association testing
  • [ ] add binary trait association, detect if case-control phenotype is detected
  • [ ] document use of .lanc in association testing

Hello, My name is Thiago Peixoto Leal and I work with Parkinson Disease in Latin American populations. I would like to know if you have a forecast of when the package will have its version for binary traits. Thank you and congratulations for the good work.

KangchengHou commented 2 years ago

Dear Thiago,

The binary trait association has been implemented, however not well tested, and i should be able to have an update in a month. I would recommend two possible routes for now:

  1. Use the current admix assoc with quantative traits mode, this will give you valid results, as in large sample size, using either logistic regression or linear regression will roughly give you similar results.
  2. As our paper https://www.nature.com/articles/s41588-021-00953-5 show that ATT, or methods do not model ancestry-specific effects are likely to be most powerful. You can perform association testing with plink --glm https://www.cog-genomics.org/plink/2.0/assoc (which is also super fast)
ThiagoPL commented 2 years ago

Dear Thiago,

The binary trait association has been implemented, however not well tested, and i should be able to have an update in a month. I would recommend two possible routes for now:

  1. Use the current admix assoc with quantative traits mode, this will give you valid results, as in large sample size, using either logistic regression or linear regression will roughly give you similar results.
  2. As our paper https://www.nature.com/articles/s41588-021-00953-5 show that ATT, or methods do not model ancestry-specific effects are likely to be most powerful. You can perform association testing with plink --glm https://www.cog-genomics.org/plink/2.0/assoc (which is also super fast)

Hello,

First of all, thank you very much for the fast answer.

I am interest on Ancestry Based methods because I work with a 3-way admixed population. In our previous work we used AM (https://onlinelibrary.wiley.com/doi/full/10.1002/ana.26153) and found new regions, but we did not replicate because we do not have good replication cohorts for AM. I also implemented a TRACTOR (using Python and R) and the results was really better than a regular GWAS (new regions, better p-values, etc).

Unfortunately our N is relatively small (1.5 k individuals) so I will probably wait for the implementation of Logistic methods on Admixt-kit. Just a suggestion: put the pheno and covar files in the test folder because there is none of these files there.

Again thank you very much.

Best,

Thiago Peixoto Leal

KangchengHou commented 2 years ago

Your results are indeed very interesting. And thanks for your suggestion. I will follow up soon on this.

KangchengHou commented 2 years ago

Hi @ThiagoPL

Please clone the latest code and reinstall as follows:

git clone https://github.com/KangchengHou/admix-kit
cd admix-kit
pip install -r requirements.txt 
pip install -U git+https://github.com/bogdanlab/tinygwas.git#egg=tinygwas
pip install -e .

for a relatively stable version of code (including logistic regression). And see https://kangchenghou.github.io/admix-kit/cli.html#association-testing for details of the parameters.

Feel free to let me know if you have questions or suggestions. I am happy to add more features to make the software easy and robust to use.

ThiagoPL commented 2 years ago

Hi @ThiagoPL

Please clone the latest code and reinstall as follows:

git clone https://github.com/KangchengHou/admix-kit
cd admix-kit
pip install -r requirements.txt 
pip install -U git+https://github.com/bogdanlab/tinygwas.git#egg=tinygwas
pip install -e .

for a relatively stable version of code (including logistic regression). And see https://kangchenghou.github.io/admix-kit/cli.html#association-testing for details of the parameters.

Feel free to let me know if you have questions or suggestions. I am happy to add more features to make the software easy and robust to use.

Hello @KangchengHou .

Thank you very much for the help. I will test here and I text you with the feedback.