Open eddy-geek opened 6 years ago
This does seem like something that I might be willing to support. Do you have an implementation path in mind? If it's complex (which includes having a complex UX), then I'm less willing to maintain it. If there's something simple we can do, then I think I'd be happy!
You're fast! I'll see if I can come up with something... from a quick look at https://github.com/BurntSushi/xsv/blob/master/src/cmd/join.rs#L222 it seems we need to basically implement the equivalent of "coalesce" on top of the current process, since it applies to all kind of joins maybe it should be implemented as a separate command? Maybe not worth exposing it, though.
Any news on this? I just stumbled across the need for this feature as well....
My PR https://github.com/BurntSushi/xsv/pull/114 was never reviewed. I've switched to python+pandas for all my csv needs now, but if @BurntSushi ever reviews the PR I'm still interested i getting it merged.
Otherwise, feel free to build that PR branch, it was working for me.
The current join behaviour with two copies of the merge column is one of the use cases for #283 about a command to drop duplicated columns.
I think it's similar to the dedupe idea above, but SQL has a function called coalesce
which simply takes the first non-null value out of a list of columns. It could be useful in other areas as a subcommand for xsv which would solve this column merge case, too, given all the duplicated columns have the same content, using the first would be correct.
eg: xsv join ... | xsv coalesce 1,3,5 [--rest]
or something.
Edit: came back and noticed that the coalesce comparison has been made already 😶
@eddy-geek If you want fast CSV parsing and manipulation of columnnar data, take a look at polars: https://github.com/pola-rs/polars/
CSV parsing in polars uses csv core and once the CSV file is parsed it is converted to arrow columnnar data for the actual dataprocessing.
Some external benchmark between various open source database like tools: https://h2oai.github.io/db-benchmark/
The example above can be be implemented like this:
In [55]: import polars as pl
In [56]: ! printf "1,2\na,aa\nc,cc\n" > X.csv
In [57]: ! printf "1,3\na,aaa\nb,bbb\n" > Y.csv
In [58]: df_x = pl.read_csv("X.csv", has_headers=True, sep=",")
In [59]: df_y = pl.read_csv("Y.csv", has_headers=True, sep=",")
In [60]: df_x
Out[60]:
shape: (2, 2)
┌─────┬─────┐
│ 1 ┆ 2 │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪═════╡
│ a ┆ aa │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ c ┆ cc │
└─────┴─────┘
In [61]: df_y
Out[61]:
shape: (2, 2)
┌─────┬─────┐
│ 1 ┆ 3 │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪═════╡
│ a ┆ aaa │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ b ┆ bbb │
└─────┴─────┘
In [62]: df_x.join(df_y, on="1", how="outer").select(pl.col(["1", "2", "3"]))
Out[62]:
shape: (3, 3)
┌─────┬──────┬──────┐
│ 1 ┆ 2 ┆ 3 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════╪══════╪══════╡
│ a ┆ aa ┆ aaa │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ b ┆ null ┆ bbb │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ c ┆ cc ┆ null │
└─────┴──────┴──────┘
In [63]: print(df_x.join(df_y, on="1", how="outer").select(pl.col(["1", "2", "3"])).to_csv())
1,2,3
a,aa,aaa
b,,bbb
c,cc,
Current behaviour:
Wanted behaviour
Why?
xsv select
maybe)Z.csv
, still on column 1, I can't with the current implementationNotice how I got 2 "b" lines now. which the user clearly wanted joined. With the
--merged
mode I could build the expected output:Is this something that you would consider? Or does it deviate from the philosophy of the tool? Do you see a better solution to the multi-join problem?