biocore-ntnu / epic

(DEPRECATED) epic: diffuse domain ChIP-Seq caller based on SICER
http://bioepic.readthedocs.io
MIT License
31 stars 6 forks source link

-w parameter #63

Closed yeroslaviz closed 7 years ago

yeroslaviz commented 7 years ago

There is something weird about the results here. When I am reading the resulted bigwig file (i2bw file) into my R session, I am getting non-overlapping bins with one position skipped each time.

GRanges object with 5451 ranges and 1 metadata column:
         seqnames             ranges strand |              score
            <Rle>          <IRanges>  <Rle> |          <numeric>
     [1]       XV         [  1, 199]      * |  0.244747966527939
     [2]       XV         [201, 399]      * |  0.857724845409393
     [3]       XV         [401, 599]      * | -0.441242098808289
     [4]       XV         [601, 799]      * | -0.302922338247299
     [5]       XV         [801, 999]      * |  -1.30326700210571

It looks line each time one base is skipped from the counting (200,400,600,etc.) It is even worse when the -w parameter is set to 150.

GRanges object with 60747 ranges and 1 metadata column:
          seqnames           ranges strand |              score
             <Rle>        <IRanges>  <Rle> |          <numeric>
      [1]        I       [  1, 149]      * |  0.566456496715546
      [2]        I       [201, 349]      * |   0.72018426656723
      [3]        I       [401, 549]      * |  0.222972229123116
      [4]        I       [601, 749]      * | -0.661386787891388
      [5]        I       [801, 949]      * | -0.159120514988899

Here each time 51 bases are skipped.

Is this deliberate? I hope it is not to confusing, that I show the R output here, as this is how I can visualize my bigwig results.

endrebak commented 7 years ago

I have never looked closely at the bigwigs. It is not intentional; I'll try to figure out why this happens, but I do not have the time within the next two weeks. However, the results are still based on data that is overlapping, without any gaps.

Can you show me how you read the data (which libraries and example code) and I'll get around to debugging this sometime in the future.

Thanks for reporting 👍

yeroslaviz commented 7 years ago

reading the file with epic

epic --control sizKO.input.0h.1.reads.bedpe --treatment sizKO.ip.0h.1.reads.bedpe -cpu 10 --genome sacCer3 -egf 0.961651807729 -cs Sc.R64.1.genome -b bedFiles/sizKO.InputIP.0h.1.bed -sm matrixFiles/sizKO.InputIP.0h.1.matrix -bw bigwigFiles/sizKO.InputIP.0h.1 -cbw ChipBigwigFiles/sizKO.InputIP.0h.1 --outfile OutputFiles/sizKO.InputIP.0h.1.Output --log logFiles/sizKO.InputIP.0h.1.log -fdr 0.1 -w 150

reading it into R

library(rtracklayer)
ip.file <- import.bw(con = "../EpicResults/sizKO.ip.0h.1.reads_log2fc.bw")
yeroslaviz commented 7 years ago

... However, the results are still based on data that is overlapping, without any gaps.

I am not sure what this means regarding the results i have. If they are overlapping, the positions are completely wrong. if the first bin ends at the 149 and the second begins at 200, I have a gap of 50 bases, if the second bin begins at 150 (or 151 if the structure is consistent), the given positions are shifted 50 bases for each bin. This would cause a huge shift at the end of each chromosome. It still looks like it is based on a 200 bases windows, but calculate the coverage for only 150 bases within this 200 bases each time.

endrebak commented 7 years ago

Great catch, I did not see that part! For the -w 150 the results are probably very wrong!

endrebak commented 7 years ago

I'll try to find some time to start looking at this tomorrow. It is a serious error.

Endre

endrebak commented 7 years ago

I can reproduce the 150-error for paired-end data, but not single-end beds. This error also shows up in the matrix creation (--sm).

endrebak commented 7 years ago

I thanked you in the CHANGELOG. Should be fixed in 0.2.5. The problem was simply a hardcoded bin-size. Thanks again for reporting 👍