cozygene / glint

22 stars 8 forks source link

ModuleNotFoundError: No module named 'argument_parser' #8

Closed AgazW closed 3 years ago

AgazW commented 3 years ago

Hello,

I was trying to run glint to find the ancestry of the methylation epic chip data but got into an issue. I use conda 4.8.3 and got an issue while using the command python glint.py --datafile data_ancestry.txt --phenofile pheno_ancestry.txt --gsave.

Validating all dependencies are installed... You are now running Anaconda Python All dependencies are installed Traceback (most recent call last): File "glint.py", line 8, in <module> from utils import common File "C:\Users\ahwani\Downloads\GLINT_1.0.4\utils\__init__.py", line 1, in <module> from argument_parser import GlintArgumentParser ModuleNotFoundError: No module named 'argument_parser'

I also tried python 2.7, but got a lot of issues in installing cvxopt. Is the tool compatible with python 2.7 only? Any help will be much appreciated.

E-R commented 3 years ago

glint is currently fully compatible with python 2.7 only and it will likely not work under python 3 without the right adjustments, so I would recommend trying to resolve the issue with cvxopt instead. If you haven't done so already, I suggest using a clean conda virtual environment with python 2.7 and then installing cvxopt.

Regardless, note that for capturing ancestry information you'd need to use the --epi flag. Please take a look into the glint tutorial which shows a typical analysis pipeline including the use of --epi.

AgazW commented 3 years ago

Thank you very much for your quick response. I was able to run glint in a clean virtual Conda environment using python 2.7. I have a followup question about the output, which looks something like

# sampleid, Predicted_sex, epi1
204568260042_R02C01 1.0             6.5014296   
204568260042_R03C01 1.0             112.16207   
204568260042_R04C01 1.0             23.491959   
204568260042_R05C01 1.0             22.246256

As I understood, according to the documentation, the variable epi1 is the PC's calculated, and these PCs can be used as a covariate for downstream analysis. However, I am interested in getting the ancestry for each sample, for example, African/European. Is it possible to get ancestry in the form mentioned? Or do you know any resources where I can get that information.

Thanks

E-R commented 3 years ago

If your samples are coming from distinct populations (e.g., AF/EU) then I would expect that the first two Epistructure PCs (epi1, epi2) will reflect that by showing clear clustering into populations (assuming you appropriately adjusted for covariates; see the documentation). However, note that with no external information there will be no clear way of telling which cluster corresponds to which population. For that you can try adding reference samples to your analysis (i.e. methylation from labeled samples coming from the same populations in your study), which should allow you to properly label the clusters, much like what you'd do in genetics; note that you'd probably need to account for an additional covariate in this case, indicating which sample is a reference sample. The case of admixed populations (i.e. you're interested in the % of each population in each of the samples) is more tricky and would require a different model.

AgazW commented 3 years ago

Thanks for this detailed answer. The data I want to check ancestry for is from Microglia cells, and I don't have any reference samples for that. I also didn't use covariates (e.g., cell proportions) because they are not mixed but a particular cell type. I agree with you that the task I'm interested in is tricky.

E-R commented 3 years ago

Are you working with samples from distinct or admixed populations? If the former, do you see any clear clustering formed by applying --epi and plotting epi1 vs epi 2? If so, it may be possible to label your samples without reference samples - if there are ancestry informative SNPs (i.e. SNPs that differentiate between populations with high probability, owing to substantial difference in minor allele frequency) that explain almost perfectly some methylation sites (there are such SNPs, see the supplementary files of the Epistructure paper; I don't know if there are any such SNPs that are also ancestry informative markers though) then you should be able to label the clusters accurately with high probability. Note that since the Epistructure paper provides a list of methylation sites that are well explained by SNPs in whole-blood data, so in your case you'd need to assume that those links are not specific to blood.

AgazW commented 3 years ago

Thanks for the explanation. My samples are from a distinct population, but I don't know which. As I mentioned, the samples are Microglia samples and treated with different hormones in the lab, generating many samples with different conditions.

I looked at the PCs attached here, and there is no clear clustering. I was expecting it to be just one cluster. It may be due to some technical artifacts.

PCs_Plot

E-R commented 3 years ago

If the samples are coming from a single population and you were expecting a single cluster then there is no apparent reason why you'd want to use Epistructure.

btw, it looks like the PC1, PC2 labels should be switched given that the Y axis demonstrates much larger variability.

AgazW commented 3 years ago

The only reason for using Epistructure was to infer the ancestry of the samples. As the samples are from a single population, It seems Epistructure is not made for that. Thank you very much for your responses.