eBay / tsv-utils

eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.
https://ebay.github.io/tsv-utils/
Boost Software License 1.0
1.43k stars 80 forks source link

Multi-file Join #134

Open Llammissar opened 6 years ago

Llammissar commented 6 years ago

Hey there, been a while. :)

I just tired to use tsv-join on a bunch of files and was very surprised when it applied the filter file to each of them in turn. Is there an approach that lets me nicely pull them all together that I've missed?

Why?: One of the really nice things about tsv-join is, since you're working with a key, ordering doesn't matter and holes in the data can be mitigated. But it's clumsy to have to repeatedly join a whole pile of intermediate things to get an end result.

To clarify: with the following data files...

left

label   v1
foo     1
bar     11
baz     111

centre

label   v2
foo     2
bar     22
baz     222

right

label   v3
foo     3
bar     33
baz     333

...I'd like to see something like this:

$ tsv-join -f left -k 1 left centre right
label   v1      v2      v3
foo     1       2       3
bar     11      22      33
baz     111     222     333

Instead, that just prints each file in turn. (N.B. I reused the filter file as a data file because I want to preserve the column ordering.)

Trying this instead...

$ tsv-join -f right -k 1 -a 2 left centre
label   v1      v3
foo     1       3
bar     11      33
baz     111     333
label   v2      v3
foo     2       3
bar     22      33
baz     222     333

...closer. Sort of. (I guess?)

This is the best I've got right now and I think it should be reliable but I haven't tested it on something substantial:

$ tsv-select -f1 left > labels
$ for I in left centre right; do tsv-join -f left -k1 $I | tsv-select -f2 > intermediate-$I; done
$ paste labels intermediate-{left,centre,right}
label   v1      v2      v3
foo     1       2       3
bar     11      22      33
baz     111     222     333
jondegenhardt commented 6 years ago

Hey, good to see you back! So, there is a way to do this:

$ tsv-join -f right -k 1 -a 2 centre | tsv-join -f - -k 1 left -a 2-3
label   v1  v2  v3
foo 1   2   3
bar 11  22  33
baz 111 222 333

I agree though, the current tsv-join APIs are clunky for this case. And it's a common enough case. Generally I'm pretty happy with the APIs of the TSV Utilities tools, but this one could be better. It's on my list to create an alternate API or perhaps a another tool that would make this simpler. Perhaps a tsv-join-files tool, or something like that. I'm not sure when I'll get to it though.

jondegenhardt commented 6 years ago

Here's another way in the current tool. This will probably feel more natural than the first example I gave:

$ tsv-join left.tsv -k 1 -f centre.tsv -a 2 | tsv-join -f right.tsv -k 1 -a 2
label   v1  v2  v3
foo 1   2   3
bar 11  22  33
baz 111 222 333

Both forms are doing similar things: join two files in the first tsv-join call, passing the output to another tsv-join call. Every subsequent tsv-join adds one more file. The second version should be better because the files are added in the order you want them listed. And, the same column numbers are appended each time. In the first form the column numbers have to be adjusted each call.

The results of the two styles are the same if there are no duplicate keys. This is normally the case when doing this type of multi-file join operation.

The time duplicate keys matters is when you are using one file as a filter on another the other. In that case, the "filter file" (-f|--filter-file) is being used as a filter on all the other files. The other files may have duplicate keys, the join doesn't care. (It's streaming join, so you can have an infinite length input.)

Llammissar commented 6 years ago

Ah, thanks for the tip. So generalising it looks something like this:

function rabid-join {
        for I in $*; do
                if [[ -z $SEED ]]; then
                        SEED=$I;
                        continue;
                elif [[ -z $FORE ]]; then
                        FORE="tsv-join $SEED -k 1 -f $I -a 2";
                        printf "$FORE"
                        continue
                else
                        printf " | tsv-join -f $I -k 1 -a 2"
                fi
        done
}

eval $(rabid-join left centre right more-right further-right absurdly-right)

Only problem I have is I can't seem to convince this to deal with the situation where a later file has keys the first one does not.