Ensembl / WiggleTools

Basic operations on the space of numerical functions defined on the genome using lazy evaluators for flexibility and efficiency
Apache License 2.0
143 stars 25 forks source link

bug in bin / question #64

Closed gevro closed 3 years ago

gevro commented 3 years ago

Hi, Does the bin command automatically fill in missing values with 0?

In other words, do I need to do: bin ## fillIn ### blah.bw in this order? Or just bin ### blah.bw ? Or fillIn ### bin ## fillIn ### blah.bw? Or fillIn ### bin ## blah.bw?

gevro commented 3 years ago

I think there is actually a bug.

In the below blah.bw, from position 35 to position 49, the value is 0.25:

$ wiggletools write_bg - blah.bw | head -n 10       
chr1    0   35  0.000000
chr1    35  141 0.250000
chr1    141 170 0.500000
chr1    170 183 0.250000
chr1    183 342 0.000000
chr1    342 343 0.145834
chr1    343 344 0.129032
chr1    344 345 0.135135
chr1    345 346 0.142857
chr1    346 347 0.145834

Here it is with fillIn:

$ wiggletools write_bg - fillIn genome.bed blah.bw | head -n 10
chr1    0   35  0.000000
chr1    35  141 0.250000
chr1    141 170 0.500000
chr1    170 183 0.250000
chr1    183 342 0.000000
chr1    342 343 0.145834
chr1    343 344 0.129032
chr1    344 345 0.135135
chr1    345 346 0.142857
chr1    346 347 0.145834

But when I do bin 50, it starts from chr1 position 49, and wrongly excludes position 35 - 49:

$ wiggletools write_bg - bin 50 blah.bw | head -n 10
chr1    49  99  12.500000
chr1    99  149 14.500000
chr1    149 199 13.750000
chr1    199 249 0.000000
chr1    249 299 0.000000
chr1    299 349 1.013238
chr1    349 399 10.488659
chr1    399 449 12.070403
chr1    449 499 10.081707
chr1    499 549 9.438597

And regardless of where I put fillIn, it doesn't help:

$ wiggletools write_bg - bin 50 fillIn genome.bed blah.bw | head -n 10
chr1    49  99  12.500000
chr1    99  149 14.500000
chr1    149 199 13.750000
chr1    199 249 0.000000
chr1    249 299 0.000000
chr1    299 349 1.013238
chr1    349 399 10.488659
chr1    399 449 12.070403
chr1    449 499 10.081707
chr1    499 549 9.438597
$ wiggletools write_bg - fillIn genome.bed bin 50 blah.bw | head -n 10
chr1    0   49  0.000000
chr1    49  99  12.500000
chr1    99  149 14.500000
chr1    149 199 13.750000
chr1    199 249 0.000000
chr1    249 299 0.000000
chr1    299 349 1.013238
chr1    349 399 10.488659
chr1    399 449 12.070403
chr1    449 499 10.081707
$ wiggletools write_bg - fillIn genome.bed bin 50 fillIn genome.bed blah.bw | head -n 10
chr1    0   49  0.000000
chr1    49  99  12.500000
chr1    99  149 14.500000
chr1    149 199 13.750000
chr1    199 249 0.000000
chr1    249 299 0.000000
chr1    299 349 1.013238
chr1    349 399 10.488659
chr1    399 449 12.070403
chr1    449 499 10.081707

Perhaps it is just an edge case bug for the first bin.

gevro commented 3 years ago

Another strange example from a different bigwig file:

$ wiggletools write_bg - blah2.bw | head -n 100
chr1    0   1   0.000000
chr1    1   2   0.000000
chr1    2   3   0.000000
chr1    3   4   0.000000
chr1    4   5   0.000000
chr1    5   6   0.000000
chr1    6   7   0.000000
chr1    7   8   0.000000
chr1    8   9   0.000000
chr1    9   10  0.000000
chr1    10  11  0.000000
chr1    11  12  0.000000
chr1    12  13  0.000000
chr1    13  14  0.000000
chr1    14  15  0.000000
chr1    15  16  0.000000
chr1    16  17  0.000000
chr1    17  18  0.000000
chr1    18  19  0.000000
chr1    19  20  0.000000
chr1    20  21  0.000000
chr1    21  22  0.000000
chr1    22  23  0.000000
chr1    23  24  0.000000
chr1    24  25  0.000000
chr1    25  26  0.000000
chr1    26  27  0.000000
chr1    27  28  0.000000
chr1    28  29  0.000000
chr1    29  30  0.000000
chr1    30  31  0.000000
chr1    31  32  0.000000
chr1    32  33  0.000000
chr1    33  34  0.000000
chr1    34  35  0.000000
chr1    35  36  0.500000
chr1    36  37  0.500000
chr1    37  38  0.500000
chr1    38  39  0.500000
chr1    39  40  0.500000
chr1    40  41  0.500000
chr1    41  42  0.500000
chr1    42  43  0.500000
chr1    43  44  0.500000
chr1    44  45  0.500000
chr1    45  46  0.500000
chr1    46  47  0.500000
chr1    47  48  0.500000
chr1    48  49  0.500000
chr1    49  50  0.500000
chr1    50  51  0.500000
chr1    51  52  0.500000
chr1    52  53  0.500000
chr1    53  54  0.500000
chr1    54  55  0.500000
chr1    55  56  0.500000
chr1    56  57  0.500000
chr1    57  58  0.500000
chr1    58  59  0.500000
chr1    59  60  0.500000
chr1    60  61  0.500000
chr1    61  62  0.500000
chr1    62  63  0.500000
chr1    63  64  0.500000
chr1    64  65  0.500000
chr1    65  66  0.500000
chr1    66  67  0.500000
chr1    67  68  0.500000
chr1    68  69  0.500000
chr1    69  70  0.500000
chr1    70  71  0.500000
chr1    71  72  0.500000
chr1    72  73  0.500000
chr1    73  74  0.500000
chr1    74  75  0.500000
chr1    75  76  0.500000
chr1    76  77  0.500000
chr1    77  78  0.500000
chr1    78  79  0.500000
chr1    79  80  0.500000
chr1    80  81  0.500000
chr1    81  82  0.500000
chr1    82  83  0.500000
chr1    83  84  0.500000
chr1    84  85  0.500000
chr1    85  86  0.500000
chr1    86  87  0.500000
chr1    87  88  0.500000
chr1    88  89  0.500000
chr1    89  90  0.500000
chr1    90  91  0.500000
chr1    91  92  0.500000
chr1    92  93  0.500000
chr1    93  94  0.500000
chr1    94  95  0.500000
chr1    95  96  0.500000
chr1    96  97  0.500000
chr1    97  98  0.500000
chr1    98  99  0.500000
chr1    99  100 0.500000
$ wiggletools bin 50 blah2.bw | head             
chr1    49  99  -20.500000
chr1    99  149 29.000000
chr1    149 199 27.500000
chr1    199 299 0.000000
chr1    299 349 2.026474
chr1    349 399 20.977322
chr1    399 449 24.140812
chr1    449 499 20.163421
chr1    499 549 18.877193
chr1    549 599 20.048252

Negative value in the first bin?

gevro commented 3 years ago

One more strange example in case this helps find the bug:

$ wiggletools write_bg - blah3.bw | head
chr1    76  114 0.108700
chr1    183 184 0.108700
chr1    184 204 0.326100
chr1    204 218 0.217400
chr1    255 256 0.217400
chr1    256 298 0.326100
$ wiggletools bin 20 blah3.bw | head
chr1    19  59  0.000000
chr1    59  79  0.326100
chr1    79  99  2.174000
chr1    99  119 1.630500
chr1    119 179 0.000000
chr1    179 199 5.000200
chr1    199 219 4.674100

The first bin width isn't even the correct size, and it's not clear why the bins start where they do. Seems arbitrary.

$ wiggletools write_bg - bin 20 blah3.bw | head
chr1    19  39  0.000000
chr1    39  59  0.000000
chr1    59  79  0.326100
chr1    79  99  2.174000
chr1    99  119 1.630500
chr1    119 139 0.000000
chr1    139 159 0.000000
chr1    159 179 0.000000
chr1    179 199 5.000200

Based on this, it seems that there might be two things going on: a bug with the first bin, and also the bin iterator seems to be using the compression module, but it shouldn't be.

gevro commented 3 years ago

And I think there is a 3rd bug in bin. After chr1, it really gets weird:

$ wiggletools bin 50 blah3.bw | grep -v "^chr1"       
chr2    248387999   248388049   -1433382744130566.750000
chr20   248388049   248388099   -1016193259775819.375000
chr21   248388099   248388149   -4707426071032891.000000
chr22   248388149   248388199   -4439755725746698.000000
chr3    248388199   248388249   -1204123472334506.250000
chr4    248388249   248388299   -1603404158277709.500000
chr5    248388299   248388349   -2109468357354108.250000
chr6    248388349   248388399   -880544406277438.125000
chr7    248388399   248388449   -3713302668858378.000000
chr8    248388449   248388499   -1212117984988667.750000
chr9    248388499   248388549   -4011027180602880.500000
chrM    248388549   248388599   -32047372042.806614
chrX    248388599   248388649   -657718690766102.875000
dzerbino commented 3 years ago

Hello @gevro ,

To answer your question, bin does actually fill in missing values. WiggleTools has an implicit notion of default value for iterators which is transmitted by composition (e.g. files have a default of 0, but say the offset operator would accordingly offset the default value).

The bugs you pointed out were largely due to some misguided alterations I made the past few days, which I reverted.

Please pull again from master.

Cheers,

Daniel

gevro commented 3 years ago

The main bug is fixed, but 'bin' is still not quite behaving properly. It looks like 'bin' output is still using a compression module. See below the first bin. Bin 0 to 160 is output as one bin, even though I specified bin 20. It should output 0 to 20, 20 to 40, etc. fillIn doesn't help.

$ wiggletools write_bg - blah3.bw | head
chr1    0   170 0.000000
chr1    170 183 0.250000
chr1    183 260 0.500000
chr1    260 266 1.000000
chr1    266 272 0.750000
chr1    272 304 0.833333
chr1    304 313 0.750000
chr1    313 315 0.687500
chr1    315 319 0.637500
chr1    319 323 0.687500
evrong01-i27-02 reapr$ wiggletools bin 20 blah3.bw | head
chr1    0   160 0.000000
chr1    160 180 2.500000
chr1    180 200 9.250000
chr1    200 260 10.000000
chr1    260 280 17.166664
chr1    280 300 16.666660
chr1    300 320 14.695832
chr1    320 340 14.314493
chr1    340 360 11.618706
chr1    360 380 9.329554
evrong01-i27-02 reapr$ wiggletools fillIn genome.bed bin 20 blah3.bw | head
chr1    0   160 0.000000
chr1    160 180 2.500000
chr1    180 200 9.250000
chr1    200 260 10.000000
chr1    260 280 17.166664
chr1    280 300 16.666660
chr1    300 320 14.695832
chr1    320 340 14.314493
chr1    340 360 11.618706
chr1    360 380 9.329554
evrong01-i27-02 reapr$ wiggletools bin 20 fillIn genome.bed blah3.bw | head
chr1    0   160 0.000000
chr1    160 180 2.500000
chr1    180 200 9.250000
chr1    200 260 10.000000
chr1    260 280 17.166664
chr1    280 300 16.666660
chr1    300 320 14.695832
chr1    320 340 14.314493
chr1    340 360 11.618706
chr1    360 380 9.329554

A work-around is to use write_bg, with or without fillIn, but regardless, bin should output equal sized-bins.

$ wiggletools write_bg - bin 20 fillIn genome.bed blah3.bw | head
chr1    0   20  0.000000
chr1    20  40  0.000000
chr1    40  60  0.000000
chr1    60  80  0.000000
chr1    80  100 0.000000
chr1    100 120 0.000000
chr1    120 140 0.000000
chr1    140 160 0.000000
chr1    160 180 2.500000
chr1    180 200 9.250000

$ wiggletools write_bg - bin 20 blah3.bw | head 
chr1    0   20  0.000000
chr1    20  40  0.000000
chr1    40  60  0.000000
chr1    60  80  0.000000
chr1    80  100 0.000000
chr1    100 120 0.000000
chr1    120 140 0.000000
chr1    140 160 0.000000
chr1    160 180 2.500000
chr1    180 200 9.250000
dzerbino commented 3 years ago

Hello @gevro ,

I've been thinking about this, and I'm not too sure what the right behaviour should be.

What you're observing is that by default wiggletools prints out results in wiggle format, which confusingly can be composed of bedgraph blocks (as in here). Now the wiggle file format is not meant to be easy to read but to be compact, so by default the output is compressed.

As you point out, with the write_bg command, the output is in proper bedgraph, much more legible and uncompressed (by default).

So the question is: if the command is wiggletools bin ... in what format should the output be?

For consistency, I would argue that it should stay as it is (compressed wig format), however, I appreciate that this is not super obvious to a casual user.

We could always encode an exception for a command that say ends by a bin command, but I do not know whether this would confuse people more than anything.

Let me know what you think,

Daniel

gevro commented 3 years ago

Hi Daniel, I was thinking the same thing. I agree, and probably should stay as it is.

One example though where I am not sure what to do is, for example, doing the histogram command after bin.

wiggletools histogram blah.txt 100 fillIn genome.bed bin 20 blah.bw -> Will histogram use compressed bins or uncompressed bins?

dzerbino commented 3 years ago

Hello @gevro ,

The compression only kicks in right before printing out wig output (onto stdout or a file), therefore in this case it would not kick in.

However, even if there were compression, histogram should not be affected by it, as the overall shape of the profile does not change.

You can test this by comparing:

wiggletools histogram blah.txt 100 fillIn genome.bed bin 20 blah.bw

and

wiggletools histogram blah.txt 100 compress fillIn genome.bed bin 20 blah.bw

Hope this helps,

Daniel

gevro commented 3 years ago

Great, thanks so much.