BurntSushi / xsv

A fast CSV command line toolkit written in Rust.
The Unlicense
10.38k stars 323 forks source link

[feature] "smart" join with column merge #113

Open eddy-geek opened 6 years ago

eddy-geek commented 6 years ago

Current behaviour:

❯ echo "1,2\na,aa\nc,cc" > X.csv
❯ echo "1,3\na,aaa\nb,bbb" > Y.csv
❯ xsv join --full 1 X.csv 1 Y.csv | xsv table
1   2   1   3
a   aa  a   aaa
c   cc
        b   bbb

Wanted behaviour

❯ xsv join --merge --full 1 X.csv 1 Y.csv | xsv table
1   2      3
a   aa     aaa
c   cc
b          bbb

Why?

BurntSushi commented 6 years ago

This does seem like something that I might be willing to support. Do you have an implementation path in mind? If it's complex (which includes having a complex UX), then I'm less willing to maintain it. If there's something simple we can do, then I think I'd be happy!

eddy-geek commented 6 years ago

You're fast! I'll see if I can come up with something... from a quick look at https://github.com/BurntSushi/xsv/blob/master/src/cmd/join.rs#L222 it seems we need to basically implement the equivalent of "coalesce" on top of the current process, since it applies to all kind of joins maybe it should be implemented as a separate command? Maybe not worth exposing it, though.

jstallkamp commented 4 years ago

Any news on this? I just stumbled across the need for this feature as well....

eddy-geek commented 4 years ago

My PR https://github.com/BurntSushi/xsv/pull/114 was never reviewed. I've switched to python+pandas for all my csv needs now, but if @BurntSushi ever reviews the PR I'm still interested i getting it merged.

Otherwise, feel free to build that PR branch, it was working for me.

peterjc commented 3 years ago

The current join behaviour with two copies of the merge column is one of the use cases for #283 about a command to drop duplicated columns.

robjtede commented 2 years ago

I think it's similar to the dedupe idea above, but SQL has a function called coalesce which simply takes the first non-null value out of a list of columns. It could be useful in other areas as a subcommand for xsv which would solve this column merge case, too, given all the duplicated columns have the same content, using the first would be correct.

eg: xsv join ... | xsv coalesce 1,3,5 [--rest] or something.

Edit: came back and noticed that the coalesce comparison has been made already 😶

ghuls commented 2 years ago

@eddy-geek If you want fast CSV parsing and manipulation of columnnar data, take a look at polars: https://github.com/pola-rs/polars/

CSV parsing in polars uses csv core and once the CSV file is parsed it is converted to arrow columnnar data for the actual dataprocessing.

Some external benchmark between various open source database like tools: https://h2oai.github.io/db-benchmark/

The example above can be be implemented like this:

In [55]: import polars as pl

In [56]: ! printf "1,2\na,aa\nc,cc\n" > X.csv

In [57]: ! printf "1,3\na,aaa\nb,bbb\n" > Y.csv

In [58]: df_x = pl.read_csv("X.csv", has_headers=True, sep=",")

In [59]: df_y = pl.read_csv("Y.csv", has_headers=True, sep=",")

In [60]: df_x
Out[60]: 
shape: (2, 2)
┌─────┬─────┐
│ 1   ┆ 2   │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪═════╡
│ a   ┆ aa  │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ c   ┆ cc  │
└─────┴─────┘

In [61]: df_y
Out[61]: 
shape: (2, 2)
┌─────┬─────┐
│ 1   ┆ 3   │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪═════╡
│ a   ┆ aaa │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ b   ┆ bbb │
└─────┴─────┘

In [62]: df_x.join(df_y, on="1", how="outer").select(pl.col(["1", "2", "3"]))
Out[62]: 
shape: (3, 3)
┌─────┬──────┬──────┐
│ 1   ┆ 2    ┆ 3    │
│ --- ┆ ---  ┆ ---  │
│ str ┆ str  ┆ str  │
╞═════╪══════╪══════╡
│ a   ┆ aa   ┆ aaa  │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ b   ┆ null ┆ bbb  │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ c   ┆ cc   ┆ null │
└─────┴──────┴──────┘

In [63]: print(df_x.join(df_y, on="1", how="outer").select(pl.col(["1", "2", "3"])).to_csv())
1,2,3
a,aa,aaa
b,,bbb
c,cc,