broadinstitute / ABC-Enhancer-Gene-Prediction

Cell type specific enhancer-gene predictions using ABC model (Fulco, Nasser et al, Nature Genetics 2019)
MIT License
203 stars 62 forks source link

Fast count method failed to count: ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam #25

Closed neekonsu closed 4 years ago

neekonsu commented 4 years ago

Hi there!

I am working on a pipeline wrapper for this repository at https://github.com/neekonsu/alphabet in order to facilitate the use of this analytical tool. I am currently running into some errors trying to run the prescribed commands from the readme, and I would be hugely appreciative of any help sorting them out. The following is the printout when running the pipeline on the example data herein. I have omitted the MACS2 output since it seems to be functioning properly. If you would like to try this out for yourselves, I have dockerized my testing environment, and it is located at https://hub.docker.com/repository/docker/neekonsu/alphabet/general if you would like to try out the pipeline.

In order to replicate my results, open the container interactively and type ./alphabet/run.sh into the terminal. Just type "default" or click enter for each of the 12 prompts until MACS2 begins, and good luck!

Thanks,

Neekon Saadat

chr22   51188356        51188653        wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peak_10529       14      .       1.54202 1.44043 0.28931 92
chr22   51189078        51189375        wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peak_10530       13      .       1.50750 1.36602 0.28931 42
chr22   51194793        51195090        wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peak_10531       15      .       1.90576 1.59073 0.37010 267
chr22   51195166        51196145        wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peak_10532       86      .       5.71728 8.62799 6.67055 437
chr22   51198355        51198930        wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peak_10533       46      .       3.81152 4.69934 2.94552 203
chr22   51201722        51202026        wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peak_10534       19      .       2.07541 1.97506 0.61642 151
chr22   51202508        51202805        wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peak_10535       11      .       1.41263 1.18152 0.28931 148
chr22   51204400        51204697        wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peak_10536       20      .       1.74140 2.00030 0.63284 148
chr22   51209309        51209606        wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peak_10537       11      .       1.41263 1.18152 0.28931 5
chr22   51209909        51210206        wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peak_10538       11      .       1.41263 1.18152 0.28931 148
chr22   51210440        51211000        wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peak_10539       20      .       2.11894 2.05549 0.66806 279
chr22   51211395        51211692        wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peak_10540       11      .       1.41263 1.18152 0.28931 148
chr22   51213441        51214600        wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peak_10541       51      .       4.15081 5.12941 3.34652 376
chr22   51221767        51222521        wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peak_10542       216     .       9.50236 21.64822        19.31868        413
chr22   51225284        51225825        wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peak_10543       15      .       1.85673 1.53513 0.34175 205
chr22   51230584        51234650        wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peak_10544       24      .       1.83638 2.42698 0.94401 3231
chr22   51239353        51239680        wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peak_10545       96      .       6.46412 9.69035 7.69163 164

CondaValueError: prefix already exists: /root/anaconda/envs/environment

# conda environments:
#
base                  *  /root/anaconda
environment              /root/anaconda/envs/environment

Running: awk 'FNR==NR {x2[$1] = $0; next} $1 in x2 {print x2[$1]}' ./example_chr22/reference/chr22 <(samtools view -H ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam | grep SQ | cut -f 2 | cut -c 4- )  > ./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.wgEncodeUwDnaseK562AlnRep1.chr22.bam.Counts.bed.temp_sort_order
Running: bedtools sort -faidx ./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.wgEncodeUwDnaseK562AlnRep1.chr22.bam.Counts.bed.temp_sort_order -i ./example_chr22/ABC_output/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted | bedtools coverage -g ./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.wgEncodeUwDnaseK562AlnRep1.chr22.bam.Counts.bed.temp_sort_order -counts -sorted -a stdin -b ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam | awk '{print $1 "\t" $2 "\t" $3 "\t" $NF}'  | bedtools sort -faidx ./example_chr22/reference/chr22 -i stdin > ./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.wgEncodeUwDnaseK562AlnRep1.chr22.bam.Counts.bed; rm ./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.wgEncodeUwDnaseK562AlnRep1.chr22.bam.Counts.bed.temp_sort_order
Fast count method failed to count: ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam

Error: The requested file (./example_chr22/ABC_output/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted) could not be opened. Error message: (No such file or directory). Exiting!

Trying bamtobed method ...

Running: bedtools bamtobed -i ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam | cut -f 1-3 | bedtools intersect -wa -a stdin -b ./example_chr22/reference/chr22.bed | bedtools sort -i stdin -faidx ./example_chr22/reference/chr22 | bedtools coverage -g ./example_chr22/reference/chr22 -counts -sorted -a ./example_chr22/ABC_output/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted -b stdin | awk '{print $1 "\t" $2 "\t" $3 "\t" $NF}' > ./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.wgEncodeUwDnaseK562AlnRep1.chr22.bam.Counts.bed
No columns to parse from file
b'Error: Unable to open file ./example_chr22/ABC_output/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted. Exiting.\n'
BEDTools failed to count file: ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam

Error: Unable to open file ./example_chr22/ABC_output/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted. Exiting.

Running: bedtools sort -i ./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.wgEncodeUwDnaseK562AlnRep1.chr22.bam.Counts.bed -faidx ./example_chr22/reference/chr22 | bedtools merge -i stdin -c 4 -o max | sort -nr -k 4 | head -n 3000 |bedtools intersect -b stdin -a ./example_chr22/ABC_output/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted -wa |awk '{print $1 "\t" $2 + $10 "\t" $2 + $10}' |bedtools slop -i stdin -b 250 -g ./example_chr22/reference/chr22 |bedtools sort -i stdin -faidx ./example_chr22/reference/chr22 |bedtools merge -i stdin | bedtools intersect -v -wa -a stdin -b ./src/../reference/wgEncodeHg19ConsensusSignalArtifactRegions.bed | cut -f 1-3 | (bedtools intersect -a ./example_chr22/reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.TSS500bp.chr22.bed -b ./example_chr22/reference/chr22.bed -wa | cut -f 1-3 && cat) |bedtools sort -i stdin -faidx ./example_chr22/reference/chr22 | bedtools merge -i stdin > ./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.candidateRegions.bed
. . . Moving on to next stage, please wait . . . 
Namespace(ATAC='', DHS='dirname ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.chr22.bam,dirname ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.chr22.bam', H3K27ac='dirname ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/Chromatin/ENCFF384ZZM.chr22.bam', candidate_enhancer_regions='./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.candidateRegions.bed', cellType='K562', chrom_sizes='./example_chr22/reference/chr22', default_accessibility_feature=None, enhancer_class_override=None, expression_table='dirname ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/Expression/K562.ENCFF934YBO.TPM.txt', gene_name_annotations='symbol', genes='./example_chr22/reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.chr22.bed', genes_for_class_assignment=None, outdir='./example_chr22/ABC_output/Neighborhoods/', primary_gene_identifier='symbol', qnorm=None, skip_gene_counts=False, skip_rpkm_quantile=False, supplementary_features=None, tss_slop_for_class_assignment=500, ubiquitously_expressed_genes='./src/../reference/UbiquitouslyExpressedGenesHG19.txt', use_secondary_counting_method=False)
Using gene expression from files: ['dirname ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/Expression/K562.ENCFF934YBO.TPM.txt'] 

[Errno 2] File dirname ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/Expression/K562.ENCFF934YBO.TPM.txt does not exist: 'dirname ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/Expression/K562.ENCFF934YBO.TPM.txt'
Traceback (most recent call last):
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 39, in load_genes
    expr = pd.read_table(expression_table, names=[primary_id, name + '.Expression'])
  File "/root/anaconda/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/root/anaconda/lib/python3.7/site-packages/pandas/io/parsers.py", line 448, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/root/anaconda/lib/python3.7/site-packages/pandas/io/parsers.py", line 880, in __init__
    self._make_engine(self.engine)
  File "/root/anaconda/lib/python3.7/site-packages/pandas/io/parsers.py", line 1114, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/root/anaconda/lib/python3.7/site-packages/pandas/io/parsers.py", line 1891, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 374, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 673, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] File dirname ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/Expression/K562.ENCFF934YBO.TPM.txt does not exist: 'dirname ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/Expression/K562.ENCFF934YBO.TPM.txt'
Failed on dirname ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/Expression/K562.ENCFF934YBO.TPM.txt
Running command: bedtools sort -faidx ./example_chr22/reference/chr22 -i ./example_chr22/ABC_output/Neighborhoods/GeneList.TSS1kb.bed > ./example_chr22/ABC_output/Neighborhoods/GeneList.TSS1kb.bed.sorted; mv ./example_chr22/ABC_output/Neighborhoods/GeneList.TSS1kb.bed.sorted ./example_chr22/ABC_output/Neighborhoods/GeneList.TSS1kb.bed
Loading coverage from pre-calculated file for Genes.H3K27ac.ENCFF384ZZM.chr22.bam
Usage: samtools idxstats [options] <in.bam>
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
  -@, --threads INT
               Number of additional threads to use [0]
Traceback (most recent call last):
  File "./src/run.neighborhoods.py", line 97, in <module>
    main(args)
  File "./src/run.neighborhoods.py", line 93, in main
    processCellType(args)
  File "./src/run.neighborhoods.py", line 74, in processCellType
    outdir = args.outdir)
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 89, in annotate_genes_with_features
    genes = count_features_for_bed(genes, bounds_bed, genome_sizes, features, outdir, "Genes", force=force, use_fast_count=use_fast_count)
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 415, in count_features_for_bed
    df = count_single_feature_for_bed(df, bed_file, genome_sizes, feature_bam, feature, directory, filebase, skip_rpkm_quantile, force, use_fast_count)
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 438, in count_single_feature_for_bed
    total_counts = count_total(feature_bam)
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 528, in count_total
    total_counts = count_bam_mapped(infile)
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 503, in count_bam_mapped
    data = check_output(command, shell=True)
  File "/root/anaconda/lib/python3.7/subprocess.py", line 411, in check_output
    **kwargs).stdout
  File "/root/anaconda/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'samtools idxstats dirname ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/Chromatin/ENCFF384ZZM.chr22.bam' returned non-zero exit status 1.
exit status 1
. . . Moving on to next stage, please wait . . . 
reading genes
reading enhancers
Making predictions for chromosome: chr22
Making putative predictions table...
Using: ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/HiC/raw/chr22/chr22.VCobserved.gz
Begin HiC
Loading HiC
Traceback (most recent call last):
  File "./src/predict.py", line 148, in <module>
    main()
  File "./src/predict.py", line 107, in main
    this_chr = make_predictions(chromosome, this_enh, this_genes, args)
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/predictor.py", line 17, in make_predictions
    pred = add_hic_to_enh_gene_table(enhancers, genes, pred, hic_file, hic_norm_file, hic_is_vc, chromosome, args)
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/predictor.py", line 66, in add_hic_to_enh_gene_table
    gamma = args.hic_gamma)
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/hic.py", line 42, in load_hic
    HiC_sparse_mat = hic_to_sparse(hic_file, hic_norm_file, hic_resolution)
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/hic.py", line 150, in hic_to_sparse
    header=None, engine='c', memory_map=True)
  File "/root/anaconda/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/root/anaconda/lib/python3.7/site-packages/pandas/io/parsers.py", line 448, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/root/anaconda/lib/python3.7/site-packages/pandas/io/parsers.py", line 880, in __init__
    self._make_engine(self.engine)
  File "/root/anaconda/lib/python3.7/site-packages/pandas/io/parsers.py", line 1114, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/root/anaconda/lib/python3.7/site-packages/pandas/io/parsers.py", line 1891, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 374, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 613, in pandas._libs.parsers.TextReader._setup_parser_source
  File "/root/anaconda/lib/python3.7/gzip.py", line 163, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
NotADirectoryError: [Errno 20] Not a directory: './example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/HiC/raw/chr22/chr22.VCobserved.gz'
exit status 1
jnasser3 commented 4 years ago

Hi Neekon, thanks for reaching out.

This may be a file name/path issue. The relevant error from your output is:

Error: The requested file (./example_chr22/ABC_output/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted) could not be opened. Error message: (No such file or directory). Exiting!

Could you confirm your pipeline outputs a narrowPeak file with this exact name and path? In the README example, this file is located in the Peaks/ subdirectory, so perhaps that is the issue.

neekonsu commented 4 years ago

Hey @jnasser3, thanks for your speedy response! :)

I overlooked the missing /Peaks/ in that printout, and indeed there was a typo in one of my shell files that omitted the directory from the makeCandidateRegions.py call.

I gave the program another go, and interestingly, that mistake doesn't seem to be what was holding up the pipeline. Take a look at the new printout, which largely overlaps with the last.

My thoughts: Looking in ABC-Output/Peaks, my pipeline seems to be generating files that begin with the shell variable symbol, eg: '$FILEBASENAME.bam.{...}', so there may be some ill syntax at play in my shell scripts, I am currently working on that.

After examining the files that makeCandidateRegions.py's subprocesses failed to analyse, it seems that my shell syntax errors are causing the bedtools and macs2 output to be written elsewhere than where makeCandidateRegions.py expects it, so I imagine my errors are really just borne (no pun intended) from the shell problems. I'll keep you posted just in case this doesn't solve the issues, but I don't get the feeling that this is an ABC-inherent bug.

Running: awk 'FNR==NR {x2[$1] = $0; next} $1 in x2 {print x2[$1]}' ./example_chr22/reference/chr22 <(samtools view -H ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam | grep SQ | cut -f 2 | cut -c 4- )  > ./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.wgEncodeUwDnaseK562AlnRep1.chr22.bam.Counts.bed.temp_sort_order
Running: bedtools sort -faidx ./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.wgEncodeUwDnaseK562AlnRep1.chr22.bam.Counts.bed.temp_sort_order -i ./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted | bedtools coverage -g ./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.wgEncodeUwDnaseK562AlnRep1.chr22.bam.Counts.bed.temp_sort_order -counts -sorted -a stdin -b ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam | awk '{print $1 "\t" $2 "\t" $3 "\t" $NF}'  | bedtools sort -faidx ./example_chr22/reference/chr22 -i stdin > ./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.wgEncodeUwDnaseK562AlnRep1.chr22.bam.Counts.bed; rm ./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.wgEncodeUwDnaseK562AlnRep1.chr22.bam.Counts.bed.temp_sort_order
Fast count method failed to count: ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam

Trying bamtobed method ...

Running: bedtools bamtobed -i ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam | cut -f 1-3 | bedtools intersect -wa -a stdin -b ./example_chr22/reference/chr22.bed | bedtools sort -i stdin -faidx ./example_chr22/reference/chr22 | bedtools coverage -g ./example_chr22/reference/chr22 -counts -sorted -a ./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted -b stdin | awk '{print $1 "\t" $2 "\t" $3 "\t" $NF}' > ./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.wgEncodeUwDnaseK562AlnRep1.chr22.bam.Counts.bed
No columns to parse from file
b''
Traceback (most recent call last):
  File "./src/makeCandidateRegions.py", line 65, in <module>
    main(args)
  File "./src/makeCandidateRegions.py", line 61, in main
    processCellType(args)
  File "./src/makeCandidateRegions.py", line 48, in processCellType
    outdir = args.outDir)
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/peaks.py", line 27, in make_candidate_regions_from_summits
    run_count_reads(accessibility_file, raw_counts_outfile, macs_peaks, genome_sizes, use_fast_count=True)
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 302, in run_count_reads
    count_bam(target, bed_file, output, genome_sizes=genome_sizes, use_fast_count=use_fast_count)
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 368, in count_bam
    if ("terminated" not in err) and ("Error" not in err) and ("ERROR" not in err) and any(data):
UnboundLocalError: local variable 'data' referenced before assignment
exit status 1
. . . Moving on to next stage, please wait . . . 
Namespace(ATAC='', DHS='dirname ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.chr22.bam,dirname ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.chr22.bam', H3K27ac='dirname ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/Chromatin/ENCFF384ZZM.chr22.bam', candidate_enhancer_regions='./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.candidateRegions.bed', cellType='K562', chrom_sizes='./example_chr22/reference/chr22', default_accessibility_feature=None, enhancer_class_override=None, expression_table='dirname ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/Expression/K562.ENCFF934YBO.TPM.txt', gene_name_annotations='symbol', genes='./example_chr22/reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.chr22.bed', genes_for_class_assignment=None, outdir='./example_chr22/ABC_output/Neighborhoods/', primary_gene_identifier='symbol', qnorm=None, skip_gene_counts=False, skip_rpkm_quantile=False, supplementary_features=None, tss_slop_for_class_assignment=500, ubiquitously_expressed_genes='./src/../reference/UbiquitouslyExpressedGenesHG19.txt', use_secondary_counting_method=False)
Using gene expression from files: ['dirname ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/Expression/K562.ENCFF934YBO.TPM.txt'] 

[Errno 2] File dirname ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/Expression/K562.ENCFF934YBO.TPM.txt does not exist: 'dirname ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/Expression/K562.ENCFF934YBO.TPM.txt'
Traceback (most recent call last):
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 39, in load_genes
    expr = pd.read_table(expression_table, names=[primary_id, name + '.Expression'])
  File "/usr/bin/anaconda/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/bin/anaconda/lib/python3.7/site-packages/pandas/io/parsers.py", line 448, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/usr/bin/anaconda/lib/python3.7/site-packages/pandas/io/parsers.py", line 880, in __init__
    self._make_engine(self.engine)
  File "/usr/bin/anaconda/lib/python3.7/site-packages/pandas/io/parsers.py", line 1114, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/usr/bin/anaconda/lib/python3.7/site-packages/pandas/io/parsers.py", line 1891, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 374, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 673, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] File dirname ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/Expression/K562.ENCFF934YBO.TPM.txt does not exist: 'dirname ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/Expression/K562.ENCFF934YBO.TPM.txt'
Failed on dirname ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/Expression/K562.ENCFF934YBO.TPM.txt
Running command: bedtools sort -faidx ./example_chr22/reference/chr22 -i ./example_chr22/ABC_output/Neighborhoods/GeneList.TSS1kb.bed > ./example_chr22/ABC_output/Neighborhoods/GeneList.TSS1kb.bed.sorted; mv ./example_chr22/ABC_output/Neighborhoods/GeneList.TSS1kb.bed.sorted ./example_chr22/ABC_output/Neighborhoods/GeneList.TSS1kb.bed
Loading coverage from pre-calculated file for Genes.H3K27ac.ENCFF384ZZM.chr22.bam
Usage: samtools idxstats [options] <in.bam>
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
  -@, --threads INT
               Number of additional threads to use [0]
Traceback (most recent call last):
  File "./src/run.neighborhoods.py", line 97, in <module>
    main(args)
  File "./src/run.neighborhoods.py", line 93, in main
    processCellType(args)
  File "./src/run.neighborhoods.py", line 74, in processCellType
    outdir = args.outdir)
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 89, in annotate_genes_with_features
    genes = count_features_for_bed(genes, bounds_bed, genome_sizes, features, outdir, "Genes", force=force, use_fast_count=use_fast_count)
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 415, in count_features_for_bed
    df = count_single_feature_for_bed(df, bed_file, genome_sizes, feature_bam, feature, directory, filebase, skip_rpkm_quantile, force, use_fast_count)
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 438, in count_single_feature_for_bed
    total_counts = count_total(feature_bam)
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 528, in count_total
    total_counts = count_bam_mapped(infile)
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 503, in count_bam_mapped
    data = check_output(command, shell=True)
  File "/usr/bin/anaconda/lib/python3.7/subprocess.py", line 411, in check_output
    **kwargs).stdout
  File "/usr/bin/anaconda/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'samtools idxstats dirname ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/Chromatin/ENCFF384ZZM.chr22.bam' returned non-zero exit status 1.
exit status 1
. . . Moving on to next stage, please wait . . . 
reading genes
reading enhancers
Making predictions for chromosome: chr22
Making putative predictions table...
Using: ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/HiC/raw/chr22/chr22.VCobserved.gz
Begin HiC
Loading HiC
Traceback (most recent call last):
  File "./src/predict.py", line 148, in <module>
    main()
  File "./src/predict.py", line 107, in main
    this_chr = make_predictions(chromosome, this_enh, this_genes, args)
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/predictor.py", line 17, in make_predictions
    pred = add_hic_to_enh_gene_table(enhancers, genes, pred, hic_file, hic_norm_file, hic_is_vc, chromosome, args)
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/predictor.py", line 66, in add_hic_to_enh_gene_table
    gamma = args.hic_gamma)
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/hic.py", line 42, in load_hic
    HiC_sparse_mat = hic_to_sparse(hic_file, hic_norm_file, hic_resolution)
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/hic.py", line 150, in hic_to_sparse
    header=None, engine='c', memory_map=True)
  File "/usr/bin/anaconda/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/bin/anaconda/lib/python3.7/site-packages/pandas/io/parsers.py", line 448, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/usr/bin/anaconda/lib/python3.7/site-packages/pandas/io/parsers.py", line 880, in __init__
    self._make_engine(self.engine)
  File "/usr/bin/anaconda/lib/python3.7/site-packages/pandas/io/parsers.py", line 1114, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/usr/bin/anaconda/lib/python3.7/site-packages/pandas/io/parsers.py", line 1891, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 374, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 613, in pandas._libs.parsers.TextReader._setup_parser_source
  File "/usr/bin/anaconda/lib/python3.7/gzip.py", line 163, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
NotADirectoryError: [Errno 20] Not a directory: './example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam/HiC/raw/chr22/chr22.VCobserved.gz'
exit status 1
neekonsu commented 4 years ago

@jnasser3 Hi again!

So I have made some progress, and there were indeed some more sneaky formatting errors where I was passing variables between stages of the pipeline.

There now only remains one error that I cannot explain, and it is in the makeCandidateRegions.py printout:

Verifying arguments for 'makeCandidateRegions.py'
——————————————————————
./src/makeCandidateRegions.py
src/makeCandidateRegions.py
——————————————————————
./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted
example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted
——————————————————————
./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam
example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam
——————————————————————
./example_chr22/ABC_output/Peaks/
example_chr22/ABC_output/Peaks/
——————————————————————
./example_chr22/reference/chr22
example_chr22/reference/chr22
——————————————————————
./reference/wgEncodeHg19ConsensusSignalArtifactRegions.bed
reference/wgEncodeHg19ConsensusSignalArtifactRegions.bed
——————————————————————
./example_chr22/reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.TSS500bp.chr22.bed
example_chr22/reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.TSS500bp.chr22.bed
——————————————————————
Running: awk 'FNR==NR {x2[$1] = $0; next} $1 in x2 {print x2[$1]}' ./example_chr22/reference/chr22 <(samtools view -H ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam | grep SQ | cut -f 2 | cut -c 4- )  > ./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.wgEncodeUwDnaseK562AlnRep1.chr22.bam.Counts.bed.temp_sort_order
Running: bedtools sort -faidx ./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.wgEncodeUwDnaseK562AlnRep1.chr22.bam.Counts.bed.temp_sort_order -i ./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted | bedtools coverage -g ./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.wgEncodeUwDnaseK562AlnRep1.chr22.bam.Counts.bed.temp_sort_order -counts -sorted -a stdin -b ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam | awk '{print $1 "\t" $2 "\t" $3 "\t" $NF}'  | bedtools sort -faidx ./example_chr22/reference/chr22 -i stdin > ./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.wgEncodeUwDnaseK562AlnRep1.chr22.bam.Counts.bed; rm ./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.wgEncodeUwDnaseK562AlnRep1.chr22.bam.Counts.bed.temp_sort_order
Fast count method failed to count: ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam

Trying bamtobed method ...

Running: bedtools bamtobed -i ./example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam | cut -f 1-3 | bedtools intersect -wa -a stdin -b ./example_chr22/reference/chr22.bed | bedtools sort -i stdin -faidx ./example_chr22/reference/chr22 | bedtools coverage -g ./example_chr22/reference/chr22 -counts -sorted -a ./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted -b stdin | awk '{print $1 "\t" $2 "\t" $3 "\t" $NF}' > ./example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.wgEncodeUwDnaseK562AlnRep1.chr22.bam.Counts.bed
No columns to parse from file
b''
Traceback (most recent call last):
  File "./src/makeCandidateRegions.py", line 65, in <module>
    main(args)
  File "./src/makeCandidateRegions.py", line 61, in main
    processCellType(args)
  File "./src/makeCandidateRegions.py", line 48, in processCellType
    outdir = args.outDir)
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/peaks.py", line 27, in make_candidate_regions_from_summits
    run_count_reads(accessibility_file, raw_counts_outfile, macs_peaks, genome_sizes, use_fast_count=True)
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 302, in run_count_reads
    count_bam(target, bed_file, output, genome_sizes=genome_sizes, use_fast_count=use_fast_count)
  File "/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 368, in count_bam
    if ("terminated" not in err) and ("Error" not in err) and ("ERROR" not in err) and any(data):
UnboundLocalError: local variable 'data' referenced before assignment
exit status 1

Notice how bedtools bamtobed ... returns: No columns to parse from file in reference to ABC-Enhancer-Gene-Prediction/example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam.

Looking at the output of file on wgEncodeUwDnaseK562AlnRep1.chr22.bam, we see that this file is in a compressed format:

file ABC-Enhancer-Gene-Prediction/example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam
ABC-Enhancer-Gene-Prediction/example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam: gzip compressed data, extra field, original size 0

The help page for bedtools bamtobed doesn't indicate that the program is equipped to decompress input files, so I wonder if this issue could be more widespread than my pipeline development. I will try decompressing and seeing what makeCandidateRegions.py says, I appreciate your help.

Please let me know if the input files in the example data are meant to be compressed, or if I should decompress them before use.

Thank you,

Neekon Saadat

jnasser3 commented 4 years ago

you shouldn't have to alter (decompress etc) the .bam files - it should just work out of the box.

What is the actual bedtools output/error if you just run the bedtools commands alone? Can you determine at what point in the bedtools commands the error is occuring?

neekonsu commented 4 years ago

@jnasser3 This is a little strange:

(base) root@d788ddf57c51:/go/src/github.com/neekonsu# cd ABC-Enhancer-Gene-Prediction/

(base) root@d788ddf57c51:/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction# awk 'FNR==NR {x2[$1] = $0; next} $1 in x2 {print x2[$1]}' \
example_chr22/reference/chr22 <(samtools view -H example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam \
| grep SQ | cut -f 2 | cut -c 4- )  > \
example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.wgEncodeUwDnaseK562AlnRep1.chr22.bam.Counts.bed.temp_sort_order

(base) root@d788ddf57c51:/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction# bedtools sort -faidx example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.wgEncodeUwDnaseK562AlnRep1.chr22.bam.Counts.bed.temp_sort_order \
-i example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted \
| bedtools coverage -g example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.wgEncodeUwDnaseK562AlnRep1.chr22.bam.Counts.bed.temp_sort_order -counts -sorted -a stdin \
-b example_chr22/input_data/Chromatin/wgEncodeUwDnaseK562AlnRep1.chr22.bam \
| awk '{print $1 "\t" $2 "\t" $3 "\t" $NF}'  | bedtools sort -faidx example_chr22/reference/chr22 -i stdin > \
example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.wgEncodeUwDnaseK562AlnRep1.chr22.bam.Counts.bed; \
rm example_chr22/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.chr22.macs2_peaks.narrowPeak.sorted.wgEncodeUwDnaseK562AlnRep1.chr22.bam.Counts.bed.temp_sort_order

(base) root@d788ddf57c51:/go/src/github.com/neekonsu/ABC-Enhancer-Gene-Prediction# 

It seems as though the commands that makeCandidateRegions.py deploys function properly when I execute them manually. My intuition would tell me to look for a discrepancy between what I plugged into the shell just now and what makeCandidateRegions.py printed out in the pipeline, however I actually copied these commands from makeCandidateRegions.py's printout in the first place, so they should be the exact same. Maybe some terminal-voodoo magic has gotten grip on the pipeline? :)

neekonsu commented 4 years ago

By the way, since the time I opened this issue, I have rooted out all of the incorrect filepaths that I was previously passing to the pipeline. Now that the paths are no longer the issue, I wonder what could be causing the bedtools errors. My hunch is that there is some subtle technicality with how file permissions or command execution works in python that is causing the same command to fail in the script yet work when executed manually.

thouis commented 4 years ago

The lack of error output and weirdness around bash might point to more memory limitation problems. I had a lot of trouble with this recently, as well.

You could edit the python to add a print statement for the exit code to help figure that out. I believe if it's killed by the out-of-memory it will have an exit code of 9.

One way to test this might be to aggressively downsample the .bam before running the python code.

On Mon, Jul 27, 2020 at 1:15 PM Neekon Saadat notifications@github.com wrote:

By the way, since the time I opened this issue, I have rooted out all of the incorrect filepaths that I was previously passing to the pipeline. Now that the paths are no longer the issue, I wonder what could be causing the bedtools errors. My hunch is that there is some subtle technicality with how file permissions or command execution works in python that is causing the same command to fail in the script yet work when executed manually.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/ABC-Enhancer-Gene-Prediction/issues/25#issuecomment-664525607, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADTPUY7GRAOYICCTXTJMQLR5WY2JANCNFSM4PD3T2QQ .

jnasser3 commented 4 years ago

@thouis the errors here are only cropping up on the example bam which is chr22 only - so its already very small

@neekonsu ok, I'm glad it works on the command line. Not quite sure what the issue is since I'm not familiar with how this pipelining code works. You should also check to make sure the pipeline 'sees' the .bai index file, which is not an input to the function but needs to be there (perhaps not for this bedtools command, but for subsequent parts of the pipeline)

neekonsu commented 4 years ago

@thouis @jnasser3 I figured it out by poking around through the sourcecode that the count_bam function in neighborhoods.py had referenced an undefined variable data in 368, which I assumed by context to mean stderrdata instead. After changing the code, this segment of the pipeline began functioning properly, however a new function in part 2 of the pipeline (referring to how you broke it up in the readme) ran into a new, but much more debuggable problem. I detail this problem in a new issue.

Thank you for your patience and help, much appreciated.

thouis commented 4 years ago

I'm guessing this would be fixed by merging: https://github.com/broadinstitute/ABC-Enhancer-Gene-Prediction/pull/24