Ensembl / WiggleTools

Basic operations on the space of numerical functions defined on the genome using lazy evaluators for flexibility and efficiency
Apache License 2.0
142 stars 24 forks source link

gte not working and compression #61

Closed gevro closed 3 years ago

gevro commented 3 years ago

Hi, I tested 'gte' and it isn't working. Example below. I'm guessing lte might have the same issue.

Note also that gt/gte/lt/lte are still causing compression with write_bg. I think that is useful in some cases, but not desirable in other cases. So I think it would be good to have it as an option: either with or without compression. Maybe:

wiggletools write_bg - bin 20 scale 0.05 fillIn genome.bed blah.bw | head
chr1    0   20  0.005435
chr1    20  40  0.000000
chr1    40  60  0.000000
chr1    60  80  0.000000
chr1    80  100 0.000000
chr1    100 120 0.000000
chr1    120 140 0.000000
chr1    140 160 0.005435
chr1    160 180 0.000000
chr1    180 200 0.000000
$ wiggletools write_bg - gte 0.005435 bin 20 scale 0.05 fillIn genome.bed blah.bw | head
chr1    280 400 1.000000
chr1    420 1940    1.000000
chr1    1960    2020    1.000000
chr1    2040    2280    1.000000
chr1    2300    2400    1.000000
chr1    2420    2440    1.000000
chr1    2460    2480    1.000000
chr1    2500    2540    1.000000
chr1    2560    2600    1.000000
chr1    2640    2660    1.000000
$ wiggletools write_bg - gte 0.005434 bin 20 scale 0.05 fillIn genome.bed blah.bw | head
chr1    0   20  1.000000
chr1    140 160 1.000000
chr1    260 2280    1.000000
chr1    2300    2440    1.000000
chr1    2460    2540    1.000000
chr1    2560    2620    1.000000
chr1    2640    2880    1.000000
chr1    2900    2920    1.000000
chr1    2960    3000    1.000000
chr1    3040    3060    1.000000
dzerbino commented 3 years ago

Hello @gevro ,

Without the raw files, I was not able to reproduce your bug, however it is possible that the values at chr1:0-20 and chr1:140-160 were just below the cutoff (0.005435) and rounded up to 0.005435 when printing.

Also, I don't understand what you mean by compression. The gte operator is returning all the regions that have a score greater or equal to 0.005435 as advertised. If you also want the regions with 0, I would recommend re-ordering the command as such (note that the binning and filling in happens twice):

wiggletools write_bg - bin 20 fillIn genome.bed gte 0.005435 bin 20 fillIn genome.bed scale 0.05 blah.bw

HTH,

Daniel

gevro commented 3 years ago

Hi @dzerbino, Thanks. Here is what I mean by the gte bug and by compression: 1) gte:

In this example, there is a bin "chr1 0 20 0.005435" that should match the "gte 0.005435" filter:

wiggletools write_bg - bin 20 scale 0.05 fillIn genome.bed blah.bw | head
chr1    0   20  0.005435
chr1    20  40  0.000000
chr1    40  60  0.000000
chr1    60  80  0.000000
chr1    80  100 0.000000
chr1    100 120 0.000000
chr1    120 140 0.000000
chr1    140 160 0.005435
chr1    160 180 0.000000
chr1    180 200 0.000000

However, when adding that filter, that bin is missing:

$ wiggletools write_bg - gte 0.005435 bin 20 scale 0.05 fillIn genome.bed blah.bw | head
chr1    280 400 1.000000
chr1    420 1940    1.000000
chr1    1960    2020    1.000000
chr1    2040    2280    1.000000

Because the gte filter is applied after "bin 20 scale 0.05", it should be a simple filter that keeps any bin with that value or larger. There technically shouldn't be any issues of rounding, because the filter is applied after the bin 20 scale 0.05 calculations. I think more intuitive behavior would be to apply the filters after rounding? Just an idea. But perhaps the current behavior is more desirable for accuracy, but if so, documentation should specify that the threshold specified should be made slightly different (-.0001 or so, for gte, and conversely for lte) in order for it to work properly.

  1. Compressed vs uncompressed: In the below example, there is a bin "chr1 260 2280 1.0000". However, it would be very useful to have uncompressed output after gt/gte/lt/lte, where every bin that passes the filter is listed. For example: chr1 260 280 1.000 chr1 280 300 1.000 chr1 300 320 1.000 etc.
$ wiggletools write_bg - gte 0.005434 bin 20 scale 0.05 fillIn genome.bed blah.bw | head
chr1    0   20  1.000000
chr1    140 160 1.000000
chr1    260 2280    1.000000
chr1    2300    2440    1.000000
chr1    2460    2540    1.000000
dzerbino commented 3 years ago

Hello @gevro ,

  1. The values are rounded just before they are printed into output (with 6 digits past the decimal dot). Therefore it is possible that a block has a value of 0.0054348, which is below 0.005435, but is rounded up to 0.005435 when printed out.

  2. I'm not getting the same results as you...

    noah-login-02.ebi.ac.uk>  wiggletools write_bg - gte 0.005435 bin 20 scale 0.05 fillIn genome.bed blah.bw | head
    chr1    109340  110360  1.000000

I think this is related to Issue #62

Cheers,

Daniel

gevro commented 3 years ago

Hi Daniel, Regarding point #2, that is interesting. There is definitely some issue here. Because the data in that region looks like this:

$wiggletools write_bg - bin 20 scale 0.05 fillIn genome.bed blah.bw
chr1    109320  109340  0.000000
chr1    109340  109360  0.569123
chr1    109360  109380  0.900428
chr1    109380  109400  0.914686
chr1    109400  109420  0.916929
chr1    109420  109440  0.908535
chr1    109440  109460  0.931414
chr1    109460  109480  0.938542
chr1    109480  109500  0.935714
chr1    109500  109520  0.915357
chr1    109520  109540  0.867500
chr1    109540  109560  0.869167
chr1    109560  109580  0.827153
chr1    109580  109600  0.781641
chr1    109600  109620  0.795388
chr1    109620  109640  0.826795

But gte 0.005435 only gives one bin as the result:

$ wiggletools write_bg - gte 0.005435 bin 20 scale 0.05 fillIn genome.bed blah.bw
chr1    109340  110360  1.000000

...which is wrong, because there are other bins that are > 0.005435

So perhaps it is related to issue #62.

dzerbino commented 3 years ago

Hello @gevro ,

Actually, you are right, there was an implicit compression system which I had forgotten about, and which I now cleared up.

Could you please try again on master?

Cheers,

Daniel

gevro commented 3 years ago

Thanks. Is it possible to add write_bg with an option for either compressed or uncompressed? Maybe with a flag? Or write_bgu versus write_bgc ?

There are use cases for both options, so it would be very useful to have both.

dzerbino commented 3 years ago

I have just now implemented a new "compress" keyword on master (cf README).

Let me know,

Daniel

gevro commented 3 years ago

Thanks, works.