Ensembl / WiggleTools

Basic operations on the space of numerical functions defined on the genome using lazy evaluators for flexibility and efficiency
Apache License 2.0
143 stars 25 forks source link

seek output duplicate rows #88

Closed gevro closed 1 year ago

gevro commented 1 year ago

Hi, I'm seeing a weird bug.

When I add seek to this:

wiggletools write_bg - compress gte 0.2 test.bw

it looks like this:

wiggletools seek chr2 1 1000000 write_bg - compress gte 0.2 test.bw

The ouptut for some reason then duplicates some rows:

> less -N chr2.seekfirst.bg
      1 chr2    160     220     1.000000
      2 chr2    3460    3940    1.000000
      3 chr2    4020    61280   1.000000
      4 chr2    61440   119720  1.000000
      5 chr2    119820  120120  1.000000
      6 chr2    120140  120160  1.000000
      7 chr2    120220  120380  1.000000
      8 chr2    120400  120780  1.000000
      9 chr2    120960  151340  1.000000
     10 chr2    151380  151600  1.000000
     11 chr2    151620  151780  1.000000
...
239 chr2    2822040 2822440 1.000000
    240 chr2    2822460 2822500 1.000000
    241 chr2    2822580 2876040 1.000000
    242 chr2    160     220     1.000000
    243 chr2    3460    3940    1.000000
    244 chr2    4020    61280   1.000000
    245 chr2    61440   119720  1.000000
    246 chr2    119820  120120  1.000000
    247 chr2    120140  120160  1.000000
    248 chr2    120220  120380  1.000000
    249 chr2    120400  120780  1.000000
    250 chr2    120960  151340  1.000000
    251 chr2    151380  151600  1.000000
    252 chr2    151620  151780  1.000000

You can see it goes back to the chr2 160 220 position, which is the first data set.

So in other words, I'm getting sequential back-to-back copies of the output.

gevro commented 1 year ago

Also, if I do trim first (before write_bg), I'm getting a different result than if I do trim last:

> head chr2.trimfirst.bg
chr1    680 940 1.000000
chr1    3060    3220    1.000000
chr1    3260    3280    1.000000
chr1    3420    9580    1.000000
chr1    9700    10280   1.000000
chr1    10340   14540   1.000000
chr1    14640   24200   1.000000
chr1    24260   24280   1.000000
chr1    26460   27500   1.000000

> head chr2.trimaftergte.bg
chr2    160 220 1.000000
chr2    3460    3940    1.000000
chr2    4020    61280   1.000000
chr2    61440   119720  1.000000
chr2    119820  120120  1.000000
chr2    120140  120160  1.000000
chr2    120220  120380  1.000000
chr2    120400  120780  1.000000

It actually looks like when I do trim first, it isn't doing anything because it is outputting all genomic locations.

dzerbino commented 1 year ago

This is because write_bg - prints out to standard output regardless of its position in the command, but the left most command also prints out to stdout. If needed, you can squash the latter by putting do at the very beginning of the command.

gevro commented 1 year ago

Ok, so just to confirm - the correct way to do this is to put trim and seek after 'write_bg -' ? wiggletools write_bg - seek chr2 1 1000000 compress gte 0.2 test.bw wiggletools write_bg - trim trim.bed compress gte 0.2 test.bw

dzerbino commented 1 year ago

Yes, that's correct.