dariober / SICERpy

Python wrapper around the popular ChIP-Seq peak caller SICER
15 stars 3 forks source link

get_total_tag_counts_bam gives different result than give_total_tag_counts_bed_graph #10

Closed endrebak closed 5 years ago

endrebak commented 5 years ago

This is on the test data. I do not know if this is a bug in the original SICER, as I find it too unpleasant to get the original running to bother 😳 Sorry.

With a remove duplicates of 1, get_total_tag_counts_bam gives 9924, give_total_tag_counts_bed_graph 9904.

# find_islands_in_pr.py
Total read count: 9904.0
# associate_tags_with_chip_and_control_w_fc_q_bam.py
chip_library_size  9924 # added print statement to line 61

This can have a teensy effect on the p-values of ChIP-seq data with many duplicates. I guess the easiest thing is just to add a warning about this instead of fixing it.

endrebak commented 5 years ago

This seems to be a bug in the original - agree/disagree? I cannot see any reason why you would want to use a different number for the lib size. Will dive deeper into this tomorrow :)

Find candidate islands exhibiting clustering ...
/mnt/work/endrebak/software/anaconda/envs/py27/bin/python /home/endrebak/code/epic_paper/SICER/src/find_islands_in_pr.py -s hg19 -b /mnt/scratch/projects/epic_bencmarks/data/sicer_results/hiya/ex/test-W200.graph -w 200 -g 600 -t 0.74 -e 1000 -f /mnt/scratch/projects/epic_bencmarks/data/sicer_results/hiya/ex/test-W200-G600.scoreisland
Species:  hg19
Window_size:  200
Gap size:  600
E value is: 1000.0
*** Total read count: 9904.0 ***
Genome Length:  3095693983
Effective genome Length:  2290813547
Window average: 0.000864670982322
Window pvalue: 0.2
Minimum num of tags in a qualified window:  1
Generate the enriched probscore summary graph and filter the summary graph to get rid of ineligible windows
Determine the score threshold from random background
The score threshold is:  7.055
Make and write islands
    chrY does not have any islands meeting the required significance
    chr21 does not have any islands meeting the required significance
Total number of islands:  166

Calculate significance of candidate islands using the control library ...
/mnt/work/endrebak/software/anaconda/envs/py27/bin/python /home/endrebak/code/epic_paper/SICER/src/associate_tags_with_chip_and_control_w_fc_q.py -s hg19  -a /mnt/scratch/projects/epic_bencmarks/data/sicer_results/hiya/ex/test-1-removed.bed -b /mnt/scratch/projects/epic_bencmarks/data/sicer_results/hiya/ex/control-1-removed.bed -d /mnt/scratch/projects/epic_bencmarks/data/sicer_results/hiya/ex/test-W200-G600.scoreisland -f 150 -t 0.74 -o /mnt/scratch/projects/epic_bencmarks/data/sicer_results/hiya/ex/test-W200-G600-islands-summary
*** chip library size   9924.0 ***
control library size   9310.0
Warning: chrM reads do not exist in /mnt/scratch/projects/epic_bencmarks/data/sicer_results/hiya/ex/test-1-removed.bed
Total number of chip reads on islands is:  340
Total number of control reads on islands is:  2
endrebak commented 5 years ago

There is indeed a bug in SICER; when computing the final FDR scores they include reads beyond the chromosome borders in the count for the ChIP and Input-library.

dariober commented 5 years ago

Thanks for digging into the issue and apologies I kept quiet!

endrebak commented 5 years ago

No problem. I am releasing a new SICER.jl soon, if you would like to contribute/have ideas I am open for including co-authors. Dunno if you have anything you'd like to change about the original though.

dariober commented 5 years ago

SICER.jl

Cool! Is this an implementation in Julia? I had minimal exposure to Julia but I really liked it. I haven't worked with ChIP-Seq data recently but I'll keep an eye on your project.

endrebak commented 5 years ago

Yes, Julia 1.0. I had complaints about epic being a memory-hog and it was true.

Furthermore in SICER.jl I compute the recurrence in the exact same way as SICER.