Feature request: xsv sort --unique

kjedamzik commented 6 years ago

would be nice to have a --unique for xsv sort for ex.:

echo -e 'foo,last_name\nB,kofi\nA,elmo\nC,elmo' |xsv sort -u last_name
foo,last_name
A,elmo
B,kofi

BurntSushi commented 6 years ago

See also #82.

Yomguithereal commented 4 years ago

I would also vouch for this. I find it better suited to unix worflows and it kinda mimicks the sort -u gimmick. I am not fond of #82's uniq command.

I would also filter duplicate row on the strict equality of the column selection. This means that in some cases some line would be arbitrarily chosen over the other but we can't be too clever about it anyway, except if we add some flag forcing equality to be done on whole line or on another selection of fields.

I can probably open a PR about this if required.

Yomguithereal commented 4 years ago

I opened #238 regarding this issue.

ThomasG77 commented 2 years ago

Would prefer an approach on a built-in tool with in xsv. Otherwise, if restricted to adopt other tools, too many choice as there as numerous tools to run sql syntax on csv to do the same. It could be something like https://github.com/harelba/q/

An excerpt to illustrate e.g

wget --content-disposition "https://data.education.gouv.fr/explore/dataset/fr-en-carte-scolaire-colleges-publics/download/?format=csv&timezone=Europe/Berlin&lang=fr&use_labels_for_header=true&csv_separator=%3B"
# from input file
q -H "SELECT DISTINCT code_region,libelle_region,code_academie,libelle_academie,code_departement,libelle_departement,code_insee,libelle_commune,Code_RNE FROM fr-en-carte-scolaire-colleges-publics.csv ORDER BY code_departement,code_insee" -d ';'
# If using stdin as input
cat fr-en-carte-scolaire-colleges-publics.csv | q -H "SELECT DISTINCT code_region,libelle_region,code_academie,libelle_academie,code_departement,libelle_departement,code_insee,libelle_commune,Code_RNE FROM - ORDER BY code_departement,code_insee" -d ';'

You can use option -O in q command line if you want headers in the output

Radiergummi commented 2 years ago

Well, what I do (also just stumbled upon this):

xsf fmt file.csv | xsv sort | uniq -u

...which also keeps the header line intact, since it only occurs once.

malthejorgensen commented 2 years ago

@Radiergummi Nice – that will work for many use-cases.

Just note that it doesn't work when the individual records contain newlines – e.g. user-generated content like posts in a StackOverflow or Reddit data dump (those are generally unique, but you get the point).

Radiergummi commented 2 years ago

@malthejorgensen Wouldn't those line breaks be escaped in the output from xsv sort?

malthejorgensen commented 2 years ago

They are escaped by putting " around the value, so the raw newlines are still present in the csv, meaning that uniq will not detect duplicates that contain newlines

# sample.csv
ID,value,date
comment_1,"Yesterday,
I went for a long walk",2022-07-01
comment_1,"Yesterday,
I went for a long walk",2022-07-01
comment_2,"Today,
I stayed inside",2022-07-02

> xsv fmt sample.csv | xsv sort | uniq -u
# Outputs `sample.csv` verbatim

BurntSushi / xsv

Feature request: xsv sort --unique #130