Ensembl / WiggleTools

Basic operations on the space of numerical functions defined on the genome using lazy evaluators for flexibility and efficiency
Apache License 2.0
143 stars 25 forks source link

Cannot stream standard input to apply or apply_paste #93

Open jluquette opened 4 months ago

jluquette commented 4 months ago

I am trying to use wiggletools to average a signal over a set of bins defined in a BED file by piping in output from a previous command into wiggletools' stdin. Based on the EBNF grammar, it looks like the following command should work, but instead it fails claiming that it cannot open the "-" file:

bash$ cat my_signal.bg | wiggletools apply_paste out.txt meanI bins.bed -
Cannot open input file -

The command works as expected if the signal bedGraph is not streamed:

bash$ wiggletools apply_paste out.txt meanI bins.bed my_signal.bg
bash$ head out.txt
22  0   100 0.000000
22  100 200 0.000000
22  200 300 0.000000
22  300 400 0.000000
22  400 500 0.000000
22  500 600 0.000000
22  600 700 0.000000
22  700 800 0.000000
22  800 900 0.000000
22  900 1000    0.000000

The situation is the same using apply directly (both the error above and the correct output when the file is directly specified). There are other tools for the use case of averaging over a BED, but I'd like to be able to build more complicated computations with wiggletools.

Why isn't "-" recognized as an iterator of type in_filename? Perhaps I'm missing something obvious - any feedback would be greatly appreciated.

dzerbino commented 4 months ago

Hello @jluquette ,

Without going into implementation details, this is a curious side effect of the apply_paste function, which triggers an unexpected exception when the last parameter is "-" i.e. standard input.

Sorry for the inconvenience,

Daniel

jluquette commented 4 months ago

Thanks @dzerbino for the very quick response.

Is there a workaround? I've tried replacing the final - with an iterator that returns the original stream unmodified (e.g., scale 1 -), but that didn't work either. The apply function also has the same behavior, perhaps due to the same implementation quirk, so that won't solve my issue either unfortunately.

dzerbino commented 4 months ago

Hello again,

I've given it some thought, and although the code can always be improved, this would ultimately hit on a design contradiction.

Fundamentally, the apply and apply_paste operators apply a statistical function (in this case meanI) to an input dataset (e.g. standard input) along regions of interest (bins.bed). The regions of interest can overlap or not be sorted, meaning that WiggleTools needs a way to arbitrarily go backwards and forwards on the input dataset. This is quite easy when the input dataset is a file or a file-based iterator, but standard input being a stream creates a complication. The obvious workaround would be to buffer the entirety of standard input, but this would create an open ended memory liability that would break WiggleTools' memory-minimal design pattern.

In conclusion, the best workaround (which you found out already), is to save the input dataset onto a file, then process that with WiggleTools, essentially using the file system as a buffer. I appreciate that there are circumstances where you are disk-space limited and this can be tricky, but by the same token placing the burden onto memory (generally speaking a more limited resource) does not seem to me as a sustainable solution. If you are constrained by writing permissions, a possibility (in Linux) would be to write into the /tmp directory which set aside specifically for that kind of purpose.

Hope this helps,

Daniel

jluquette commented 4 months ago

Thanks for the explanation - that makes plenty of sense.

Perhaps it'd be worth mentioning in the documentation and/or pointing out the difference in the EBNF grammar? By the way, is there a more up-to-date documentation source than the GitHub README? Some things (like the cat reducer) aren't mentioned in there.

Really enjoying the tool - thanks for the great work.

dzerbino commented 4 months ago

The GitHub README is the only documentation. Indeed, there has been some drift in the documentation, I should review it.