eBay / tsv-utils

eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.
https://ebay.github.io/tsv-utils/
Boost Software License 1.0
1.42k stars 80 forks source link

Column-wise maths (sum-product?) #68

Open Llammissar opened 7 years ago

Llammissar commented 7 years ago

Hey, been a while! Just hit another thing that'd be useful for my work, so I figured I'd put it here.

Consider the following file sample.tsv:

script  weight  actions
foo.js  100 6
bar.js  600 6
baz.js  1200    5
qux.js  500 2

If we want to know the total number of actions for each script, each second column value needs to be multiplied by the third for each line. Right now, I can get this far:

tsv-select --fields 2,3 sample.tsv | \
    keep-header -- /bin/sh -c \
        '(sed "s/\t/ \* /" | xargs printf "%d\n" $(bc $1))'

Which will give me:

weight
600
3600
6000
1000

Hm. Not quite. Better still, the rows match, but we can't join this with the current tools. We can sort of fake it with some awful circumlocutions involving number-lines:

tsv-select --fields 2,3 sample.tsv | \
    keep-header -- /bin/sh -c \
        '(sed "s/\t/ \* /" | xargs printf "%d\n" $(bc $1))' | \
    number-lines --H | \
    tsv-join -H  -k 1 -a 2,3,4 -f <(number-lines -H sample.tsv) | \
    tsv-select -H -f 3,4,5,2

gives me:

script  weight  actions weight
foo.js  100 6   600
bar.js  600 6   3600
baz.js  1200    5   6000
qux.js  500 2   1000

Well, it's better. The new column is named wrong, though. Dang.

More broadly, that's only multiplication. What I'd like to see is something like... this isn't really tsv-summarize, which is columnar reductions. Maybe it's an advanced selection? tsv-select -H -f1,2,3 --multiply 2:3:total_actions Or maybe it's a new tool; tsv-eval or something. Hm. :/

jondegenhardt commented 7 years ago

The specific example you listed can be handled pretty easily using awk. More generally though, none of the tsv-utilities tools provides a way to generate new values with mathematical expressions, or in the case of tsv-filter, the ability to filter rows based on a mathematical formula. The general reason I haven't introduced such facilities is that I haven't found a way to do this without essentially recreating significant parts of awk. awk is a great tool, there's no reason to rebuild it. I have a couple ideas on this front, but nothing I'm likely to get to soon.

Llammissar commented 7 years ago

Embarrassingly, I've barely used awk for anything, and simply didn't think to reach for it. I spent a few minutes playing and the result was substantially better, so I can see your point. Didn't really help the column naming, but that's a smaller issue.

In the long term, I think there are things that would be nice to have in this general category, but agree that it's not an especially high priority.

jondegenhardt commented 7 years ago

One of the reasons I wrote these tools is that awk, powerful as it is, becomes error prone to write as soon as things get at all complicated. And, it is possible to get a large portion of the power by providing a set of much simpler and less error prone primitives. An important part of the intent of these tools.

But, for general mathematical manipulation, it seems hard to create a language that is significantly simpler than awk, or at least that's my feeling for now.