bulik / ldsc

LD Score Regression (LDSC)
GNU General Public License v3.0
617 stars 335 forks source link

Error parsing .annot file #435

Open reshu23 opened 2 months ago

reshu23 commented 2 months ago

Dear Team, I am calculating Partition heritability , but I am getting error of "parsing annotation error". Can you please help me with this? Here is my command: GWAS trait: PASS.Bipolar_Disorder_ALL.Mullins2021


Beginning analysis at Wed May 29 08:25:24 2024 Reading summary statistics from Database_unzip/sumstats_107/PASS.Bipolar_Disorder_ALL.Mullins2021.sumstats.gz ... Read summary statistics for 1169835 SNPs. Reading reference panel LD Score from Database_unzip/1000G_Phase3_baselineLD_v2.2_ldscores/baselineLD.,hg19_C1_Undiff/hg19_C1_Undiff.[1-22] ... (ldscore_fromlist) Read reference panel LD Scores for 1190321 SNPs. Removing partitioned LD Scores with zero variance. Reading regression weight LD Score from Database_unzip/1000G_Phase3_weights_hm3_no_MHC/weights.hm3_noMHC.[1-22] ... (ldscore_fromlist) Read regression weight LD Scores for 1187349 SNPs. After merging with reference panel LD, 1169835 SNPs remain. After merging with regression SNP LD, 1167015 SNPs remain. Removed 0 SNPs with chi^2 > 413.466 (1167015 SNPs remain) Printing covariance matrix of the estimates to EnrichmentResult/PASS.Bipolar_Disorder_ALL.Mullins2021/baselineLDv2.2/PASS.Bipolar_Disorder_ALL.Mullins2021_hg19_C1_Undiff_baselineLDv2.2.cov. Printing block jackknife delete values to EnrichmentResult/PASS.Bipolar_Disorder_ALL.Mullins2021/baselineLDv2.2/PASS.Bipolar_Disorder_ALL.Mullins2021_hg19_C1_Undiff_baselineLDv2.2.delete. Printing partitioned block jackknife delete values to EnrichmentResult/PASS.Bipolar_Disorder_ALL.Mullins2021/baselineLDv2.2/PASS.Bipolar_Disorder_ALL.Mullins2021_hg19_C1_Undiff_baselineLDv2.2.part_delete. Total Observed scale h2: 0.0706 (0.003) Categories: baseL2_0 Coding_UCSCL2_0 Coding_UCSC.flanking.500L2_0 Conserved_LindbladTohL2_0 Conserved_LindbladToh.flanking.500L2_0 CTCF_HoffmanL2_0 CTCF_Hoffman.flanking.500L2_0 DGF_ENCODEL2_0 DGF_ENCODE.flanking.500L2_0 DHS_peaks_TrynkaL2_0 DHS_TrynkaL2_0 DHS_Trynka.flanking.500L2_0 Enhancer_AnderssonL2_0 Enhancer_Andersson.flanking.500L2_0 Enhancer_HoffmanL2_0 Enhancer_Hoffman.flanking.500L2_0 FetalDHS_TrynkaL2_0 FetalDHS_Trynka.flanking.500L2_0 H3K27ac_HniszL2_0 H3K27ac_Hnisz.flanking.500L2_0 H3K27ac_PGC2L2_0 H3K27ac_PGC2.flanking.500L2_0 H3K4me1_peaks_TrynkaL2_0 H3K4me1_TrynkaL2_0 H3K4me1_Trynka.flanking.500L2_0 H3K4me3_peaks_TrynkaL2_0 H3K4me3_TrynkaL2_0 H3K4me3_Trynka.flanking.500L2_0 H3K9ac_peaks_TrynkaL2_0 H3K9ac_TrynkaL2_0 H3K9ac_Trynka.flanking.500L2_0 Intron_UCSCL2_0 Intron_UCSC.flanking.500L2_0 PromoterFlanking_HoffmanL2_0 PromoterFlanking_Hoffman.flanking.500L2_0 Promoter_UCSCL2_0 Promoter_UCSC.flanking.500L2_0 Repressed_HoffmanL2_0 Repressed_Hoffman.flanking.500L2_0 SuperEnhancer_HniszL2_0 SuperEnhancer_Hnisz.flanking.500L2_0 TFBS_ENCODEL2_0 TFBS_ENCODE.flanking.500L2_0 Transcr_HoffmanL2_0 Transcr_Hoffman.flanking.500L2_0 TSS_HoffmanL2_0 TSS_Hoffman.flanking.500L2_0 UTR_3_UCSCL2_0 UTR_3_UCSC.flanking.500L2_0 UTR_5_UCSCL2_0 UTR_5_UCSC.flanking.500L2_0 WeakEnhancer_HoffmanL2_0 WeakEnhancer_Hoffman.flanking.500L2_0 GERP.NSL2_0 GERP.RSsup4L2_0 MAFbin1L2_0 MAFbin2L2_0 MAFbin3L2_0 MAFbin4L2_0 MAFbin5L2_0 MAFbin6L2_0 MAFbin7L2_0 MAFbin8L2_0 MAFbin9L2_0 MAFbin10L2_0 MAF_Adj_Predicted_Allele_AgeL2_0 MAF_Adj_LLD_AFRL2_0 Recomb_Rate_10kbL2_0 Nucleotide_Diversity_10kbL2_0 Backgrd_Selection_StatL2_0 CpG_Content_50kbL2_0 MAF_Adj_ASMCL2_0 GTEx_eQTL_MaxCPPL2_0 BLUEPRINT_H3K27acQTL_MaxCPPL2_0 BLUEPRINT_H3K4me1QTL_MaxCPPL2_0 BLUEPRINT_DNA_methylation_MaxCPPL2_0 synonymousL2_0 non_synonymousL2_0 Conserved_Vertebrate_phastCons46wayL2_0 Conserved_Vertebrate_phastCons46way.flanking.500L2_0 Conserved_Mammal_phastCons46wayL2_0 Conserved_Mammal_phastCons46way.flanking.500L2_0 Conserved_Primate_phastCons46wayL2_0 Conserved_Primate_phastCons46way.flanking.500L2_0 BivFlnkL2_0 BivFlnk.flanking.500L2_0 Human_Promoter_VillarL2_0 Human_Promoter_Villar.flanking.500L2_0 Human_Enhancer_VillarL2_0 Human_Enhancer_Villar.flanking.500L2_0 Ancient_Sequence_Age_Human_PromoterL2_0 Ancient_Sequence_Age_Human_Promoter.flanking.500L2_0 Ancient_Sequence_Age_Human_EnhancerL2_0 Ancient_Sequence_Age_Human_Enhancer.flanking.500L2_0 Human_Enhancer_Villar_Species_Enhancer_CountL2_0 Human_Promoter_Villar_ExACL2_0 Human_Promoter_Villar_ExAC.flanking.500L2_0 L2_1 Lambda GC: 1.4459 Mean Chi^2: 1.5878 Intercept: 1.0393 (0.0102) Ratio: 0.0668 (0.0173) Reading annot matrix from Database_unzip/1000G_Phase3_baselineLD_v2.2_ldscores/baselineLD.,hg19_C1_Undiff/hg19_C1_Undiff.[1-22] ... (annot) Error parsing .annot file. Analysis finished at Wed May 29 08:25:54 2024 Total time elapsed: 30.1s Traceback (most recent call last): File "/scratch/prj/working/Projects/ATACseq/LDSC/python3_ldsc/./ldsc.py", line 645, in sumstats.estimate_h2(args, log) File "/scratch/prj/working/Projects/ATACseq/LDSC/python3_ldsc/ldscore/sumstats.py", line 369, in estimate_h2 overlap_matrix, M_tot = _read_annot(args, log) ^^^^^^^^^^^^^^^^^^^^^^ File "/scratch/prj/working/Projects/ATACseq/LDSC/python3_ldsc/ldscore/sumstats.py", line 95, in _read_annot overlap_matrix, M_tot = _read_chr_split_files(args.ref_ld_chr, args.ref_ld, log, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/scratch/prj/working/Projects/ATACseq/LDSC/python3_ldsc/ldscore/sumstats.py", line 152, in _read_chr_split_files out = parsefunc(_splitp(chr_arg), _N_CHR, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/scratch/prj/working/Projects/ATACseq/LDSC/python3_ldsc/ldscore/parse.py", line 190, in annot chrs = get_present_chrs(fh, num+1) ^^ UnboundLocalError: cannot access local variable 'fh' where it is not associated with a value

aksarkar commented 2 months ago

@reshu23 It means that either ldsc cannot find the files Database_unzip/1000G_Phase3_baselineLD_v2.2_ldscores/hg19_C1_Undiff/hg19_C1_Undiff.1.annot through Database_unzip/1000G_Phase3_baselineLD_v2.2_ldscores/hg19_C1_Undiff/hg19_C1_Undiff.22.annot or the files Database_unzip/1000G_Phase3_baselineLD_v2.2_ldscores/baselineLD.1.annot through Database_unzip/1000G_Phase3_baselineLD_v2.2_ldscores/baselineLD.22.annot.

When you specify --overlap-annot, you must make the annotation files available to read.

reshu23 commented 2 months ago

Thank you. But .annot files are present in the same directory of ldscore: Here is my command: As you can see parameter: --ref-ld-chr Database_unzip/1000G_Phase3_baselineLD_v2.2_ldscores/baselineLD.,hg19_C1_Undiff/hg19_C1_Undiff.

These 2 folders contain all LD scores as well as annot files in gz format. I have just provided prefix name by assuming that, ldsc will automatically append .annot along with chromosome number and look for files in the directory. This is the problem, I am unable to understand.

I do not think, I need to use --annot parameter for heritability calculation ./ldsc.py --out EnrichmentResult/PASS.Bipolar_Disorder_ALL.Mullins2021/baselineLDv2.2/PASS.Bipolar_Disorder_ALL.Mullins2021_hg19_C1_Undiff_baselineLDv2.2 --h2 Database_unzip/sumstats_107/PASS.Bipolar_Disorder_ALL.Mullins2021.sumstats.gz --ref-ld-chr Database_unzip/1000G_Phase3_baselineLD_v2.2_ldscores/baselineLD.,hg19_C1_Undiff/hg19_C1_Undiff. --w-ld-chr Database_unzip/1000G_Phase3_weights_hm3_no_MHC/weights.hm3_noMHC. --overlap-annot --print-coefficients --frqfile-chr Database_unzip/1000G_Phase3_frq/1000G.EUR.QC. --print-cov --print-delete-vals

aksarkar commented 2 months ago

The code appends the chromosome number followed by .annot.gz to the prefix specified on the command line.

I suspect the trailing . in --ref-ld-chr Database_unzip/1000G_Phase3_baselineLD_v2.2_ldscores/baselineLD.,hg19_C1_Undiff/hg19_C1_Undiff. is the issue.

reshu23 commented 2 months ago

Thank you. I removed trailing character, but as I expected, it could not read ldscore files(error). As it does not include dot with chromosome number in prefix of filename . So, I do not think, trailing character is problem. May be some problem with _splitp() function. in sumstats.py

Reading reference panel LD Score from Database_unzip/1000G_Phase3_baselineLD_v2.2_ldscores/baselineLD,hg19_C1_Undiff/hg19_C1_Undiff[1-22] ... (ldscore_fromlist) Analysis finished at Fri May 31 05:34:34 2024 Total time elapsed: 1.33s Traceback (most recent call last): File "/scratch/prj/working/Projects/ATACseq/LDSC/python3_ldsc/./ldsc.py", line 645, in sumstats.estimate_h2(args, log) File "/scratch/prj/working/Projects/ATACseq/LDSC/python3_ldsc/ldscore/sumstats.py", line 325, in estimate_h2 M_annot, w_ld_cname, ref_ld_cnames, sumstats, novar_cols = _read_ld_sumstats( ^^^^^^^^^^^^^^^^^^ File "/scratch/prj/working/Projects/ATACseq/LDSC/python3_ldsc/ldscore/sumstats.py", line 243, in _read_ld_sumstats ref_ld = _read_ref_ld(args, log)

reshu23 commented 2 months ago

I am talking about following section in sumstats.py def _splitp(fstr): flist = fstr.split(',') flist = [os.path.expanduser(os.path.expandvars(x)) for x in flist] return flist ############################## def _read_annot(args, log): '''Read annot matrix.''' try: if args.ref_ld is not None: overlap_matrix, M_tot = _read_chr_split_files(args.ref_ld_chr, args.ref_ld, log, 'annot matrix', ps.annot, frqfile=args.frqfile) elif args.ref_ld_chr is not None: overlap_matrix, M_tot = _read_chr_split_files(args.ref_ld_chr, args.ref_ld, log, 'annot matrix', ps.annot, frqfile=args.frqfile_chr) except Exception: log.log('Error parsing .annot file.') raise

return overlap_matrix, M_tot

################################### def _read_chr_split_files(chr_arg, not_chr_arg, log, noun, parsefunc, kwargs): '''Read files split across 22 chromosomes (annot, ref_ld, w_ld).''' try: if not_chr_arg: log.log('Reading {N} from {F} ... ({p})'.format(N=noun, F=not_chr_arg, p=parsefunc.name)) out = parsefunc(_splitp(not_chr_arg), kwargs) elif chr_arg: f = ps.sub_chr(chr_arg, '[1-22]') log.log('Reading {N} from {F} ... ({p})'.format(N=noun, F=f, p=parsefunc.name)) out = parsefunc(_splitp(chr_arg), _N_CHR, **kwargs)

aksarkar commented 1 month ago

This is a bug in the code since Python changed the scoping of variables bound in list comprehensions

https://github.com/bulik/ldsc/blob/master/ldscore/parse.py#L190

You will need to modify the source code and replace fh with fh_list[0]

reshu23 commented 1 month ago

Thank you. I have changed code: line number 190 parse.py chrs = get_present_chrs(fh_list[0], num+1)

It is working now.Problem is solved.