biocore-ntnu / epic

(DEPRECATED) epic: diffuse domain ChIP-Seq caller based on SICER
http://bioepic.readthedocs.io
MIT License
31 stars 6 forks source link

Add instructions how to add a new genome #22

Closed balwierz closed 8 years ago

balwierz commented 8 years ago

I tried running epic on zebrafish data. For this I created a file: /usr/local/lib/python2.7/dist-packages/epic/scripts/chromsizes/danRer7.chromsizes containing chormosome sizes and run epic on paired-end data:

epic --treatment myReads.bedpe.bz2 \
    --control myInput.bedpe.bz2 \
    --number-cores 8 \
    --genome danRer7 \
    --effective_genome_length 0.9 \
    --paired-end

But I ran into problems, which I don't understand.


Merging ChIP and Input data. (File: helper_functions, Log level: INFO, Time: Tue, 05 Jul 2016 17:32:56 )
0.9effective_genome_length (File: compute_background_probabilites, Log level: DEBUG, Time: Tue, 05 Jul 2016 17:33:01 )
200 window size (File: compute_background_probabilites, Log level: DEBUG, Time: Tue, 05 Jul 2016 17:33:01 )
0 total chip count (File: compute_background_probabilites, Log level: DEBUG, Time: Tue, 05 Jul 2016 17:33:01 )
0.0average_window_readcount (File: compute_background_probabilites, Log level: DEBUG, Time: Tue, 05 Jul 2016 17:33:01 )
1island_enriched_threshold (File: compute_background_probabilites, Log level: DEBUG, Time: Tue, 05 Jul 2016 17:33:01 )
4.0gap_contribution (File: compute_background_probabilites, Log level: DEBUG, Time: Tue, 05 Jul 2016 17:33:01 )
1.0boundary_contribution (File: compute_background_probabilites, Log level: DEBUG, Time: Tue, 05 Jul 2016 17:33:01 )
Traceback (most recent call last):
  File "/usr/local/bin/epic", line 164, in <module>
    run_epic(args)
  File "/usr/local/lib/python2.7/dist-packages/epic/run/run_epic.py", line 45, in run_epic
    compute_background_probabilities(nb_chip_reads, args)
  File "/usr/local/lib/python2.7/dist-packages/epic/statistics/compute_background_probabilites.py", line 48, in compute_background_probabilities
    boundary_contribution, genome_length_in_bins)
  File "/usr/local/lib/python2.7/dist-packages/epic/statistics/compute_score_threshold.py", line 24, in compute_score_threshold
    current_scaled_score = int(round(score / BIN_SIZE))
OverflowError: cannot convert float infinity to integer

Could you please let me know how to proceed.

Piotr

endrebak commented 8 years ago

Thanks for trying epic. I'll write docs eventually, but for now I'll just add that data for you manually. Are there any genomes in addition to danrer7 you'd like?

endrebak commented 8 years ago

Added the genome to epic/scripts/genome.snakefile then I ran the command snakemake -j 25 --resources instances=1 -s genome.snakefile.

Note that it needs a boatload of RAM to compute the EGS so I am doing it on our server. Will ping you when I have updated epic with the new data, will be tomorrow at the latest.

balwierz commented 8 years ago

Thanks. danRer7 and danRer10

endrebak commented 8 years ago

In your logging message you got 0 chip counts. Strange...

Anyways, 0.1.1 is out on pypi with danRer7/10 added.

Keeping this issue open as a reminder.

balwierz commented 8 years ago

Hi endrebak, Thanks for adding zebrafish genomes! I am back from holidays and tried epic 0.1.1 on my data, but I am getting exactly the same error as above.

Merging ChIP and Input data. (File: helper_functions, Log level: INFO, Time: Fri, 05 Aug 2016 15:21:52 )
1240914652.0 effective_genome_length (File: compute_background_probabilites, Log level: DEBUG, Time: Fri, 05 Aug 2016 15:21:52 )
200 window size (File: compute_background_probabilites, Log level: DEBUG, Time: Fri, 05 Aug 2016 15:21:52 )
0 total chip count (File: compute_background_probabilites, Log level: DEBUG, Time: Fri, 05 Aug 2016 15:21:52 )
0.0 average_window_readcount (File: compute_background_probabilites, Log level: DEBUG, Time: Fri, 05 Aug 2016 15:21:52 )
1 island_enriched_threshold (File: compute_background_probabilites, Log level: DEBUG, Time: Fri, 05 Aug 2016 15:21:52 )
4.0 gap_contribution (File: compute_background_probabilites, Log level: DEBUG, Time: Fri, 05 Aug 2016 15:21:52 )
1.0 boundary_contribution (File: compute_background_probabilites, Log level: DEBUG, Time: Fri, 05 Aug 2016 15:21:52 )
Traceback (most recent call last):
  File "/usr/local/bin/epic", line 165, in <module>
    run_epic(args)
  File "/usr/local/lib/python2.7/dist-packages/epic/run/run_epic.py", line 45, in run_epic
    compute_background_probabilities(nb_chip_reads, args)
  File "/usr/local/lib/python2.7/dist-packages/epic/statistics/compute_background_probabilites.py", line 49, in compute_background_probabilities
    boundary_contribution, genome_length_in_bins)
  File "/usr/local/lib/python2.7/dist-packages/epic/statistics/compute_score_threshold.py", line 24, in compute_score_threshold
    current_scaled_score = int(round(score / BIN_SIZE))
OverflowError: cannot convert float infinity to integer

What worries me are the 0 total chip count and 0.0 average window readcount. It seems like for some reason the bedpe files were not parsed correctly. They are bz2-compressed (with .bz2 filename) and look as follows:

chr3    40932504        40932603        chr3    40932590        40932623        HWI-D00467:172:C8VJCANXX:2:1101:1145:20094      42      +       -

I tried removing --number-cores but it didn´t help. I checked the code of count_reads_in_windows and confirm that i do have bzgrep, perl, List::Util and standard unix commands installed used in this function.

Now I see that _count_reads_in_windows_paired_end doesn't try to use bzgrep. I will try fixing the code.

Edit: seems like the bzgrep fix helped, epic didn't crash so far.

endrebak commented 8 years ago

Thanks! I have had no bedpe files to test on, had to make my own. I'll try to fix this the next week 👍

I should have seen that your data was bgzipped.

endrebak commented 8 years ago

This should be fixed in 0.1.2 which is out on PyPI, thanks. Keeping this open since the original issue was about something else (and much less critical).

balwierz commented 8 years ago

Thanks. I think there is one more change to be done:

--- windows/count/count_reads_in_windows.py.old 2016-08-08 13:42:40.183875265 +0100
+++ windows/count/count_reads_in_windows.py 2016-08-08 13:43:01.957903095 +0100
@@ -120,7 +120,7 @@
     grep, duplicate_handling = _options(bed_file, keep_duplicates)

     command = """
-    grep -E "^{chromosome}\\b.*{chromosome}\\b.*" {bed_file} | # Both chromos must be equal; no chimeras (?)
+    {grep} -E "^{chromosome}\\b.*{chromosome}\\b.*" {bed_file} | # Both chromos must be equal; no chimeras (?)
     cut -f 1-6  | sort -k2,3n -k4,5n | {duplicate_handling} # get chr start end chr start end for PE; sort on location
     LC_ALL=C perl -a -ne 'use List::Util qw[min max]; $start = min($F[1], $F[2]); $end = max($F[4], $F[5]); $middle = $start + int(($end - $start)/2); $bin = $middle - $middle % 200; print "$F[0] $bin\\n"' | # Find bin of midpoint between start and ends
     uniq -c |
endrebak commented 8 years ago

God, I am a moron. Thanks!

Ps. that code might seem clunky, but it was much faster than all the other, more sane, alternatives I tried.

endrebak commented 8 years ago

Should be fixed in 0.1.14 out on PyPI now. Thanks again.