cozygene / glint

22 stars 8 forks source link

Issue generating glint files for downstream analysis #16

Open cewim opened 11 months ago

cewim commented 11 months ago

Hello,

I am attempting to run glint using conda 23.9.0 and Python 2.7.15. My data are from the EPIC chip v2 and CpGs annotated using this annotation package for the v2 chip.

I've successfully run glint on a methylation dataset from the EPIC chip v1 using Anaconda2 prompt and Python 2.7. Now, I'm running into issues simply generating the glint files needed for this new analysis.

(Please forgive the raw strings pointing directly to Python/glint/data files - this was the only way I could originally get glint to run.)

The code I'm using is:

C:\Users\MyUser\AppData\Local\anaconda3\envs\python2_glint\python C:\Users\MyUser\glint\glint.py --datafile C:\Users\MyUser\glint\datafileSafe.txt --phenofile C:\Users\MyUser\glint\phenotypes.txt --covarfile C:\Users\MyUser\glint\covariates.txt --gsave

And the output I receive is:

Validating all dependencies are installed...
You are now running Anaconda Python
All dependencies are installed
INFO      >>> python C:\Users\MyUser\glint\glint.py --datafile C:\Users\MyUser\glint\datafileSafe.txt --phenofile C:\Users\MyUser\glint\phenotypes.txt --covarfile C:\Users\MyUser\glint\covariates.txt --gsave
INFO      Starting GLINT...
INFO      Validating arguments...
INFO      Loading file C:\Users\MyUser\glint\datafileSafe.txt...
INFO      Switching to space delimited matrix...
WARNING   C:\Users\MyUser\glint\utils\common.py:121: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  first_col = DataFrame.as_matrix(read_csv(filepath, dtype=str, delimiter=delimiter, usecols=[0], header=None))

WARNING   C:\Users\MyUser\glint\utils\common.py:177: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  data = DataFrame.as_matrix(data)

INFO      Checking for missing values in the data file...
INFO      Validating phenotypes file...
INFO      Loading file C:\Users\MyUser\glint\phenotypes.txt...
INFO      Switching to space delimited matrix...
INFO      New phenotypes were found: Analyte.
INFO      Validating covariates file...
INFO      Loading file C:\Users\MyUser\glint\covariates.txt...
INFO      Switching to space delimited matrix...
INFO      New covariates were found: Sample_Plate, Bas, Bmem, Bnv, CD4mem, CD4nv, CD8mem, CD8nv, Eos, Mono, Neu, NK.
INFO      Loading information about methylation sites...
INFO      Searching for relevant methylation sites information...
Traceback (most recent call last):
  File "C:\Users\MyUser\glint\glint.py", line 312, in <module>
    parser.run()
  File "C:\Users\MyUser\glint\glint.py", line 295, in run
    self.meth_parser.save(output_perfix = prefix) #save after all preprocessing  add epi and refactor covars
  File "C:\Users\MyUser\glint\parsers\methylation_data_parser.py", line 202, in save
    self.module.save_serialized_data(output_perfix)
  File "C:\Users\MyUser\glint\modules\methylation_data.py", line 490, in save_serialized_data
    self.save_sites_and_samples(prefix)
  File "C:\Users\MyUser\glint\modules\methylation_data.py", line 432, in save_sites_and_samples
    sites_info = sitesinfo.SitesInfoGenerator(self.cpgnames)
  File "C:\Users\MyUser\glint\utils\sitesinfo.py", line 55, in __init__
    categories)
  File "C:\Users\MyUser\glint\utils\sitesinfo.py", line 13, in __init__
    self.positions = positions.astype(int)
ValueError: invalid literal for int() with base 10: ''

I am not sure how to interpret this error.

I have done the following:

Could you please help me figure out why this is happening?

I'm a new glint/Python user and would appreciate your patience with me on this issue.

Thank you for your time and for monitoring this page!

E-R commented 11 months ago

The error tells us that at least one probe position was an empty string and therefore it could not be interpreted as an int position. The position information used by glint is extracted from HumanMethylationSites and there are no empty strings there (there are some nan values though). Have you changed the HumanMethylationSites file?

cewim commented 11 months ago

Understood. No, I have not changed the HumanMethylationSites file at all. And I verified that none of the CpG sites listed in the rownames of my beta matrix were empty strings.

Could the issue be with the annotation file I'm using for the EPIC chip v2? I am not aware of any other annotation packages available for v2.

Edited to add: After removing the suffixes from the CpG sites, there are duplicated CpGs in the row names. However, the error happens regardless of whether the suffixes are attached.

E-R commented 11 months ago

Can you try running the gsave command only on CpGs that are at the intersection between your data and the CpGs with non-nan positions in the HumanMethylationSites file? Does it work? If so, does including a single CpG that is not in HumanMethylationSites raise the same position conversion error?

cewim commented 11 months ago

Thank you for your response. Yes, I will try this and get back to you ASAP.

cewim commented 11 months ago

Thank you for your patience. I subset my data matrix to include only CpGs from the HumanMethylationSites file which did not have "nan" values for UCSC_RefGene_Name and Relation_to_UCSC_CpG_Island. I was able to successfully run glint using this pared down data file.

I wonder what this indicates to you and also why I am able to use data from the EPIC chip v1 without subsetting to CpGs with non-nan positions in the HumanMethylationSites file.

E-R commented 11 months ago

HumanMethylationSites includes CpGs from EPIC too.