aotimme / gocsv

Command-line CSV processing utility.
MIT License
198 stars 21 forks source link
csv csv-processing

GoCSV

Command line CSV processing tool inspired by csvkit. But faster and less memory intensive. And written in Go.

To install on Apple OS X, open a Terminal window and run

/bin/bash <(curl -s https://raw.githubusercontent.com/aotimme/gocsv/master/scripts/install-latest-darwin-amd64.sh)

For other platforms, see the Installation section.

Table of Contents

Introduction

The tool is built for pipelining, so most commands accept a CSV from standard input and output to standard output.

Subcommands:

To view the usage of gocsv at the command line, use the help subcommand (i.e. gocsv help). This will also print out the version of the gocsv binary as well as the hash of the git commit of this repository on which the binary was built. To view only the version and git hash, use the version subcommand (i.e. gocsv version).

Subcommands

add

Aliases: template, tmpl

Add a column to a CSV.

Usage:

gocsv add [--prepend] [--name NAME] [--template TEMPLATE] FILE

Arguments:

Note that the --template argument for this subcommand is a string providing a template for the new column. Templates are parsed using the html/template package provided by Go and can reference any column by the name of the column, along with a special variable index that represents the row number (starting at 1).

For example, if your CSV has a column named Name, you can do

gocsv add -t "Hello, {{.Name}}! You are number {{.index}} in line."

For multi-word columns there is a slightly different syntax. Say you have a column called Full Name. Then the following template would work:

gocsv add -t 'Hello {{index . "Full Name"}}! You are number {{.index} in line.'

GoCSV has been loaded with utility functions from Sprig. This will help you to perform wide range of text manipulation on your template on top of built-in Go template functionalities.

Here is an example of how to add a new column of extracting hashtags using RegEx (RE2), sort, remove duplicates and then join with comma seperated from an existing column:

gocsv add -t '{{ regexFindAll "#[\\w\\-]+" .Comments -1 | sortAlpha | uniq | join ", " }}' --name 'Hashtags'

For further reference on the options available for text manipulation see Sprig documentation.

TIP: You will have to take note that Regular Expressions need to be escaped.

autoincrement

Alias: autoinc

Append (or prepend) a column of incrementing integers to each row. This can be helpful to be able to map back to the original row after a number of transformations.

Usage:

gocsv autoincrement [--prepend] [--name NAME] [--seed SEED] FILE

Arguments:

behead

Remove the header from a CSV

Usage:

gocsv behead [-n N] FILE

Arguments:

cap

Add a header row to a CSV

Usage:

gocsv cap --names NAMES [--truncate-names] [--default-name DEFAULT_NAME] FILE

Arguments:

The subcommand will error if:

For example:

echo Jamie,52,Purple | gocsv cap --names 'Name,Age,Favorite color'
Name,Age,Favorite color
Jamie,52,Purple
echo Jamie,52,Purple | gocsv cap --names 'Name' --default-name 'Col'
Name,Col,Col 1
Jamie,52,Purple
echo Jamie,52,Purple | gocsv cap --default-name 'Col'
Col,Col 1,Col 2
Jamie,52,Purple

clean

Clean a CSV of common formatting issues. Currently this consists of making sure all rows are the same length (padding short rows and trimming long ones) and removing empty rows at the end.

Note that this subcommand, along with other subcommands, will include a newline at the end of the last line of the outputted CSV. This is because gocsv assumes that every row in a CSV (or other-delimited text file) will end in a new line.

Usage:

gocsv clean [--verbose] [--no-trim] [--strip-bom] [--excel] [--numbers] FILE

Arguments:

Note that only one of --add-bom or --strip-bom can be specified.

delimiter

Alias: delim

Change the delimiter being used for a CSV.

Usage:

gocsv delim [--input INPUT_DELIMITER] [--output OUTPUT_DELIMITER] FILE

Arguments:

describe

Get basic information about a CSV. This will output the number of rows and columns in the CSV, the column headers in the CSV, and the inferred type of each column.

Usage

gocsv describe FILE

dimensions

Alias: dims

Get the dimensions of a CSV.

Usage

gocsv dimensions [--csv] FILE

Arguments:

filter

Filter a CSV by rows whose columns match some criterion.

Usage:

gocsv filter [--columns COLUMNS] [--equals STR] [--regex REGEX] [--gt N] [--gte N] [--lt N] [--lte N] [--exclude] FILE

Arguments:

Note that one of --regex, --equals (-eq), --gt , --gte, --lt, or --lte must be specified.

head

Extract the first N rows from a CSV.

Usage:

gocsv head [-n N] FILE

Arguments:

headers

View the headers of a CSV along with the index of each header.

Usage:

gocsv headers [--csv] FILE

Arguments:

join

Join two CSVs using an inner (default), left, right, or outer join.

Usage:

gocsv join --columns COLUMNS [--left] [--right] [--outer] LEFT_FILE RIGHT_FILE

Arguments:

Note that by default it will perform an inner join. It will exit if you specify multiple types of join.

ncol

Get the number of columns in a CSV.

Usage:

gocsv ncol FILE

nrow

Get the number of rows in a CSV.

Usage:

gocsv nrow FILE

rename

Rename the headers of a CSV.

Usage:

gocsv rename --columns COLUMNS --names NAMES FILE

Arguments:

replace

Replace values in cells by regular expression.

Usage:

gocsv replace [--columns COLUMNS] --regex REGEX --repl REPLACEMENT FILE

Arguments:

Note that if you have a capture group in the --regex argument you can reference that in the replacement argument using "\$1" for the first capture group, "\$2" for the second capture group, etc.

sample

Sample rows from a CSV

Usage

gocsv sample -n NUM_ROWS [--replace] [--seed SEED] FILE

Arguments:

select

Select (or exclude) columns from a CSV

Usage:

gocsv select --columns COLUMNS [--exclude] FILE

Arguments:

sort

Sort a CSV by multiple columns, with or without type inference. The currently supported types are float, int, date, and string.

Usage:

gocsv sort --columns COLUMNS [--stable] [--reverse] [--no-inference] FILE

Arguments:

When --stable and --reverse are both specified, the original order of equal rows is preserved (and not reversed).

split

Split a CSV into multiple files.

Usage:

gocsv split --max-rows N [--filename-base FILENAME] FILE

Arguments:

sql

Run SQL queries on CSVs.

Usage:

gocsv sql --query QUERY FILE [FILES]

Arguments:

When passing in files, you may read from standard input by specifying the filename -.

Table names are derived from the CSV filenames by taking the base filename without the file extension. For example, test-files/stats.csv is referenced as a table with the name stats. The table from standard input - should be referenced as the table stdin.

This subcommand uses SQLite3 under the hood. It attempts to infer column types for defining the SQL tables, but all the rules of dynamic typing and type affinity in SQLite3 still pertain.

See Datatypes In SQLite Version 3 for more information.

Also note that this subcommand makes no attempts to prevent SQL injection (either via the input CSVs or via the query).

stack

Stack multiple CSVs to create a larger CSV. Optionally include an indication of which file a row came from in the final CSV.

Usage:

gocsv stack [--filenames] [--groups GROUPS] [--group-name GROUP_NAME] FILE [FILES]

Arguments:

Note that --groups and --filenames are mutually exclusive.

Specifying a file by name - will read a CSV from standard input.

stats

Get some basic statistics on a CSV.

Usage:

gocsv stats FILE

tail

Extract the last N rows from a CSV.

Usage:

gocsv tail [-n N] FILE

Arguments:

transpose

Transpose a CSV.

Usage:

gocsv tranpose FILE

tsv

Transform a CSV into a TSV. It is shortand for gocsv delim -o "\t" FILE. This can very useful if you want to pipe the result to pbcopy (OS X) in order to paste it into a spreadsheet tool.

Usage:

gocsv tsv FILE

unique

Alias: uniq

Extract unique rows based upon certain columns.

Usage:

gocsv unique [--columns COLUMNS] [--sorted] [--count] FILE

Arguments

view

Display a CSV in a pretty tabular format.

Usage:

gocsv view [-n N] [--max-width N] FILE

Arguments:

If the length of a cell exceeds --max-width it will be truncated with an ellipsis.

xlsx

Convert sheets of a XLSX file to CSV.

Usage:

gocsv xlsx [--list-sheets] [--dirname DIRNAME] [--sheet SHEET] FILE

Arguments:

By default the xlsx subcommand will convert all the sheets in FILE to CSVs to a directory with the same name as FILE.

zip

Zip multiple CSVs into one CSV.

Usage:

gocsv zip FILE [FILES]

Specifying a file by name - will read a CSV from standard input.

Specifying Columns

When specifying columns on the command line (i.e. with the --columns or -c flags), you can specify either the indices or the names of the columns. The tool will always try to interpret columns first by index and then by name.

Specifying Columns by Index

The tool uses 1-based indexing (as in the output of the headers subcommand).

The tool also allows for specification of ranges with indices (e.g. 2-4) including reverse ranges (e.g. 4-2). It also allows for open-ended ranges on indexes (e.g. 2- or -4). In the former case (a-) it will include all columns from a on. In the latter case (-b) it will include all columns before b and b itself.

Specifying Columns by Name

When specifying the name of a column, it will match all columns that are exact case-sensitive matches.

When referencing a column name that has whitespace, either escape the whitespace with \ or use quotes (") around the column name.

For example, if you have a column named Hello World,

gocsv select -c "Hello World" test.csv

or

gocsv select -c Hello\ World test.csv

When referencing multiple columns, specify column names as a comma-delimited list with no spaces between the columns. If any of the column names have whitespace, enclose the entire list in a single set of quotes.

gocsv select -c "Hello World,Foo Bar" test.csv

Regular Expression Syntax

A few of the subcommands allow the ability to pass in regular expressions via a --regex flag (e.g. filter and replace).

Because the regular expressions passed in to the --regex flag are parsed by the underlying regexp Go package, see the regexp/syntax documentation for more details on the syntax. It is based on the syntax accepted by RE2.

Pipelining

Because all of the subcommands support receiving a CSV from standard input, you can easily pipeline:

cat test-files/left-table.csv \
  | gocsv join --left --columns LID,RID test-files/right-table.csv \
  | gocsv filter --columns XYZ --regex "[ev]e-\d$" \
  | gocsv select --columns LID,XYZ \
  | gocsv sort --columns LID,XYZ

Pipelining Support

Subcommand Input Output
add
autoincrement
behead
clean
delimiter
describe N/A
dimensions *
filter
head
headers *
join
ncol N/A
nrow N/A
rename
replace
sample
select
sort
split N/A
sql
stack
stats N/A
tail
transpose
tsv
unique
view N/A
xlsx N/A

* dimensions and headers write to CSV format when using the --csv argument.

stack and sql read from standard input when specifying the filename as -.

xlsx sends output to standard out when using the --sheet flag.

Changing the Default Delimiter

While gocsv generally assumes standard CSVs (per RFC 4180), you can specify a default delimiter other than , using the GOCSV_DELIMITER environment variable. The delimiter must evaluate to exactly 1 "rune". If it does not, gocsv will error.

For example, to use semicolon-delimited files:

export GOCSV_DELIMITER=";"
gocsv select -c 1 semicolon-delimited.scsv

Or, to use tab-delimited files (TSVs):

export GOCSV_DELIMITER="\t"
gocsv select -c 1 tab-delimited.tsv

Or, for more exotic delimiters you can use hexadecimal or unicode (e.g. \x01 or \u0001 for the SOH delimiter):

export GOCSV_DELIMITER="\x01"
gocsv select -c 1 soh-delimited.tsv

Examples

Copy Values
gocsv tsv test-files/left-table.csv | pbcopy
Reorder Columns
gocsv select --columns 2,1 test-files/left-table.csv
Duplicate Columns
gocsv select --columns 1,1,2,2 test-files/left-table.csv
VLOOKUP aka Join
gocsv join --left --columns LID,RID test-files/left-table.csv test-files/right-table.csv
Distinct Column Values
gocsv select --columns LID test-files/left-table.csv | gocsv behead | sort | uniq
Count of Distinct Column Values
gocsv select --columns LID test-files/left-table.csv | gocsv behead | sort | uniq -c | sort -nr
Extract Rows Matching Regular Expression
gocsv filter --columns ABC --regex "-1$" test-files/left-table.csv
Extract Rows with Blank Values in Column
gocsv filter --columns Stringer --regex "^$" test-files/stats.csv

If you also want to match on cells that have only whitespace, you can use a regular expression like "^s*$".

Replace Content in Cells By Regular Expression
gocsv replace --columns ABC --regex "^(.*)-(\d)$" -i --repl "\$2-\$1" test-files/left-table.csv
Sort by Multiple Columns
gocsv sort --columns LID,ABC --reverse test-files/left-table.csv
Combine Multiple CSVs
gocsv stack --groups "Primer Archivo,Segundo Archivo,Tercer Archivo" --group-name "Orden de Archivo" test-files/stack-1.csv test-files/stack-2.csv test-files/stack-3.csv

To do the same via pipelining through standard input,

cat test-files/stack-1.csv | gocsv stack --groups "Primer Archivo,Segundo Archivo,Tercer Archivo" --group-name "Orden de Archivo" - test-files/stack-2.csv test-files/stack-3.csv
Create a Column from a Template
cat test-files/stats.csv | gocsv add -t "Row {{.index}}: {{if eq .Boolean \"T\"}}{{.Floater}}{{else}}{{.Integer}}{{end}}" -name "Integer or Floater"

Debugging

To enable debugging mode when running a gocsv command, specify the --debug command line argument to any subcommand (other than gocsv help and gocsv version). Any errors will then also print out a stack trace.

Installation

For the latest pre-built binaries, cross-compiled using xgo, see the Latest Release page.

Apple OS X

Simple Version

Open a Terminal window and paste the following command:

/bin/bash <(curl -s https://raw.githubusercontent.com/aotimme/gocsv/master/scripts/install-latest-darwin-amd64.sh)

This will install gocsv at /usr/local/bin/gocsv.

Detailed Version

To install the pre-built binary for Apple OS X, download the gocsv-darwin-amd64.zip file. It should download into your ~/Downloads directory. To install it, open a Terminal window and do the following:

cd ~/Downloads
unzip gocsv-darwin-amd64.zip
mv gocsv-darwin-amd64/gocsv /usr/local/bin
rmdir gocsv-darwin-amd64

To verify that it has installed, open a new Terminal window and run

gocsv help

You should see the gocsv help message.

Linux

Installing the pre-built binary for Linux is very similar to installing the binary for Apple OS X. First, download gocsv-linux-amd64.zip. Assuming this downloads to your ~/Downloads directory, open a Terminal window and run the following commands:

cd ~/Downloads
unzip gocsv-linux-amd64.zip
mv gocsv-linux-amd64/gocsv /usr/local/bin
rmdir gocsv-linux-amd64

To verify that it has installed, open a new Terminal window and run

gocsv help

You should see the gocsv help message.

Windows

Download gocsv-windows-amd64.zip. Unzip the file and you should see a file gocsv.exe. Put that executable in the appropriate location and it should work.