NicoleYY77 commented 8 months ago

Hi, I'm trying to use the public data as mentioned in the supplementary file: H3K27ac-ChIP: ENCFF384ZZM and DHS: wgEncodeUwDnaseK562AlnRep1.bam, wgEncodeUwDnaseK562AlnRep2.bam. And I also change the reference file to hg19 before running snakemake -j1.

Feature H3K27ac completed but I got error with DHS: Generating /lustre03/project/6007512/nicoley/EPI_prediction/ABCmodel/ABC-Enhancer-Gene-Prediction/results/K562_hg19/Neighborhoods/Genes.DHS.wgEncodeUwDnaseK562AlnRep1.bam.CountReads.bedgraph Counting coverage for Genes.DHS.wgEncodeUwDnaseK562AlnRep1.bam Running command: awk 'BEGIN {FS=OFS="\t"} (substr($1, length($1)) == "X" || substr($1, length($1)) == "Y") { $4 *= 2 } 1' /lustre03/project/6007512/nicoley/EPI_prediction/ABCmodel/ABC-Enhancer-Gene-Prediction/results/K562_hg19/Neighborhoods/Genes.DHS.wgEncodeUwDnaseK562AlnRep1.bam.CountReads.bedgraph > /lustre03/project/6007512/nicoley/EPI_prediction/ABCmodel/ABC-Enhancer-Gene-Prediction/results/K562_hg19/Neighborhoods/Genes.DHS.wgEncodeUwDnaseK562AlnRep1.bam.CountReads.bedgraph.tmp && mv /lustre03/project/6007512/nicoley/EPI_prediction/ABCmodel/ABC-Enhancer-Gene-Prediction/results/K562_hg19/Neighborhoods/Genes.DHS.wgEncodeUwDnaseK562AlnRep1.bam.CountReads.bedgraph.tmp /lustre03/project/6007512/nicoley/EPI_prediction/ABCmodel/ABC-Enhancer-Gene-Prediction/results/K562_hg19/Neighborhoods/Genes.DHS.wgEncodeUwDnaseK562AlnRep1.bam.CountReads.bedgraph Traceback (most recent call last): File "/lustre03/project/6007512/nicoley/EPI_prediction/ABCmodel/ABC-Enhancer-Gene-Prediction/workflow/scripts/run.neighborhoods.py", line 209, in <module> main(args) File "/lustre03/project/6007512/nicoley/EPI_prediction/ABCmodel/ABC-Enhancer-Gene-Prediction/workflow/scripts/run.neighborhoods.py", line 204, in main processCellType(args) File "/lustre03/project/6007512/nicoley/EPI_prediction/ABCmodel/ABC-Enhancer-Gene-Prediction/workflow/scripts/run.neighborhoods.py", line 171, in processCellType annotate_genes_with_features( File "/lustre03/project/6007512/nicoley/EPI_prediction/ABCmodel/ABC-Enhancer-Gene-Prediction/workflow/scripts/neighborhoods.py", line 126, in annotate_genes_with_features genes = count_features_for_bed( ^^^^^^^^^^^^^^^^^^^^^^^ File "/lustre03/project/6007512/nicoley/EPI_prediction/ABCmodel/ABC-Enhancer-Gene-Prediction/workflow/scripts/neighborhoods.py", line 512, in count_features_for_bed df = count_single_feature_for_bed( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lustre03/project/6007512/nicoley/EPI_prediction/ABCmodel/ABC-Enhancer-Gene-Prediction/workflow/scripts/neighborhoods.py", line 567, in count_single_feature_for_bed total_counts = count_total(feature_bam) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/lustre03/project/6007512/nicoley/EPI_prediction/ABCmodel/ABC-Enhancer-Gene-Prediction/workflow/scripts/neighborhoods.py", line 703, in count_total total_counts = count_bam_mapped(infile) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/lustre03/project/6007512/nicoley/EPI_prediction/ABCmodel/ABC-Enhancer-Gene-Prediction/workflow/scripts/neighborhoods.py", line 670, in count_bam_mapped raise ValueError("Error counting BAM file: count <= 0") ValueError: Error counting BAM file: count <= 0

I checked macs2_peaks.narrowPeak.sorted.averageAccessibility.Counts.bed file and it does have lots of 0 counts, I'm confused that why this happen? the examples run successfully, did I miss any steps? and how should I fix this problem? Really appreciate for any help!!

atancoder commented 7 months ago

can you share your entire config file? One thing I can't tell is if you had changed the genome size parameter mentioned here: https://abc-enhancer-gene-prediction.readthedocs.io/en/latest/usage/getting_started.html#genome-builds

NicoleYY77 commented 7 months ago

can you share your entire config file? One thing I can't tell is if you had changed the genome size parameter mentioned here: https://abc-enhancer-gene-prediction.readthedocs.io/en/latest/usage/getting_started.html#genome-builds

Thank you! Here's my entire config file:

`# working directory is ABC-Enhancer-Gene-Prediction

INPUT DATA

Add your inputs here

biosamplesTable: "config/config_K562.tsv" # replace with your own config/config_biosamples.tsv

OUTPUT DATA

results_dir: "results/"

REFERENCE FILES

ref: chrom_sizes: "reference/hg19/chr_sizes" regions_blocklist: "reference/hg19/wgEncodeHg19ConsensusSignalArtifactRegions.bed" ubiquitous_genes: "reference/UbiquitouslyExpressedGenes.txt" genes: "reference/hg19/hg19_ensembl.bed" genome_tss: "reference/hg19/hg19_ensembl_TSS.bed" qnorm: "reference/EnhancersQNormRef.K562.txt" abc_thresholds: "reference/abc_thresholds.tsv"

RULE SPECIFIC PARAMS

params_macs: pval: 0.1 genome_size: hs # shortcuts: hs: human ; mm: mice threads: 10

These parameters are used to run the Run Candidate Regions portion of the abc code

params_candidate: peakExtendFromSummit: 250 nStrongestPeaks: 150000

These parameters are used to run the Run Neighborhoods portion of the abc code

params_neighborhoods: use_qnorm: True

These parameters are used to run the Predictions portion of the abc code

params_predict: flags: "--scale_hic_using_powerlaw" hic_gamma: 1.024238616787792 # avg hic gamma hic_scale: 5.9594510043736655 # avg hic scale hic_pseudocount_distance: 5000 # powerlaw at this distance is added to the contact for all predictions

params_filter_predictions: score_column: 'ABC.Score' threshold: null # null => Automatic determination based on input include_self_promoter: True only_expressed_genes: False

INTERNAL USE ONLY

ABS path to prepend to reference and script files

Only relevant when using ABC as a submodule

ABC_DIR_PATH: ""`

The genome size parameter for hg19 is different?

atancoder commented 7 months ago

Genome size should be fine. The repo wasn't as well supported for hg19, so some of the reference files were missing (i noticed you generated the ensemble versions of the gene TSS). I've updated the reference files for hg19 in the latest release. The 0 counts looks relevant. Not sure why you're getting that but from our end, i was able to run ABC with the following DNase seq on hg19 without any issues: https://www.encodeproject.org/files/ENCFF001DOX/.

NicoleYY77 commented 7 months ago

Genome size should be fine. The repo wasn't as well supported for hg19, so some of the reference files were missing (i noticed you generated the ensemble versions of the gene TSS). I've updated the reference files for hg19 in the latest release. The 0 counts looks relevant. Not sure why you're getting that but from our end, i was able to run ABC with the following DNase seq on hg19 without any issues: https://www.encodeproject.org/files/ENCFF001DOX/.

Thanks for the update! I didn't figure out why the 0 counts happened, but I merged the 2replicates of the Dnase-seq files (wgEncodeUwDnaseK562AlnRep1 &2.bam, as shown in supplementary table4) by samtools and that fixed the problem. However, after running all the code, the candidate enhancers are 151,894, not the same as 162,181 in the paper... So I'm not sure whether merging the replicate bam files is correct...

I also tried running the following Benchmark analysis following this: https://github.com/EngreitzLab/CRISPR_comparison/blob/main/README.md, where I used the data from Supplementary table6a as the experimental dataset for comparison with the prediction, but couldn't replicate the results in the paper... Did I miss any steps here or understand anything incorrectly?

atancoder commented 7 months ago

If your goal is to replicate the results exactly, you'd have to use the older version of the repo: https://github.com/broadinstitute/ABC-Enhancer-Gene-Prediction/tree/master. Just beware that you'll have to debug any issues on your own as we no longer support that version.

NicoleYY77 commented 7 months ago

If your goal is to replicate the results exactly, you'd have to use the older version of the repo: https://github.com/broadinstitute/ABC-Enhancer-Gene-Prediction/tree/master. Just beware that you'll have to debug any issues on your own as we no longer support that version.

Thank you!

broadinstitute / ABC-Enhancer-Gene-Prediction

ValueError("Error counting BAM file: count <= 0") ValueError: Error counting BAM file: count <= 0 when trying to replicate the ABC score results of K562 #207

INPUT DATA

Add your inputs here

OUTPUT DATA

REFERENCE FILES

RULE SPECIFIC PARAMS

These parameters are used to run the Run Candidate Regions portion of the abc code

These parameters are used to run the Run Neighborhoods portion of the abc code

These parameters are used to run the Predictions portion of the abc code

INTERNAL USE ONLY

ABS path to prepend to reference and script files

Only relevant when using ABC as a submodule