Open kjedamzik opened 6 years ago
See also #82.
I would also vouch for this. I find it better suited to unix worflows and it kinda mimicks the sort -u
gimmick. I am not fond of #82's uniq
command.
I would also filter duplicate row on the strict equality of the column selection. This means that in some cases some line would be arbitrarily chosen over the other but we can't be too clever about it anyway, except if we add some flag forcing equality to be done on whole line or on another selection of fields.
I can probably open a PR about this if required.
I opened #238 regarding this issue.
Would prefer an approach on a built-in tool with in xsv. Otherwise, if restricted to adopt other tools, too many choice as there as numerous tools to run sql syntax on csv to do the same. It could be something like https://github.com/harelba/q/
An excerpt to illustrate e.g
wget --content-disposition "https://data.education.gouv.fr/explore/dataset/fr-en-carte-scolaire-colleges-publics/download/?format=csv&timezone=Europe/Berlin&lang=fr&use_labels_for_header=true&csv_separator=%3B"
# from input file
q -H "SELECT DISTINCT code_region,libelle_region,code_academie,libelle_academie,code_departement,libelle_departement,code_insee,libelle_commune,Code_RNE FROM fr-en-carte-scolaire-colleges-publics.csv ORDER BY code_departement,code_insee" -d ';'
# If using stdin as input
cat fr-en-carte-scolaire-colleges-publics.csv | q -H "SELECT DISTINCT code_region,libelle_region,code_academie,libelle_academie,code_departement,libelle_departement,code_insee,libelle_commune,Code_RNE FROM - ORDER BY code_departement,code_insee" -d ';'
You can use option -O
in q
command line if you want headers in the output
Well, what I do (also just stumbled upon this):
xsf fmt file.csv | xsv sort | uniq -u
...which also keeps the header line intact, since it only occurs once.
@Radiergummi Nice – that will work for many use-cases.
Just note that it doesn't work when the individual records contain newlines – e.g. user-generated content like posts in a StackOverflow or Reddit data dump (those are generally unique, but you get the point).
@malthejorgensen Wouldn't those line breaks be escaped in the output from xsv sort
?
They are escaped by putting "
around the value, so the raw newlines are still present in the csv, meaning that uniq
will not detect duplicates that contain newlines
# sample.csv
ID,value,date
comment_1,"Yesterday,
I went for a long walk",2022-07-01
comment_1,"Yesterday,
I went for a long walk",2022-07-01
comment_2,"Today,
I stayed inside",2022-07-02
> xsv fmt sample.csv | xsv sort | uniq -u
# Outputs `sample.csv` verbatim
would be nice to have a --unique for xsv sort for ex.: