Data-Liberation-Front / csvlint.rb

The gem behind http://csvlint.io
MIT License
283 stars 86 forks source link

Build Status Dependency Status Coverage Status License Badges

CSV Lint

A ruby gem to support validating CSV files to check their syntax and contents. You can either use this gem within your own Ruby code, or as a standalone command line application

Summary of features

Development

ruby version 3.3

Tests

The codebase includes both rspec and cucumber tests, which can be run together using:

$ rake

or separately:

$ rake spec
$ rake features

When the cucumber tests are first run, a script will create tests based on the latest version of the CSV on the Web test suite, including creating a local cache of the test files. This requires an internet connection and some patience. Following that download, the tests will run locally; there's also a batch script:

$ bin/run-csvw-tests

which will run the tests from the command line.

If you need to refresh the CSV on the Web tests:

$ rm bin/run-csvw-tests
$ rm features/csvw_validation_tests.feature
$ rm -r features/fixtures/csvw

and then run the cucumber tests again or:

$ ruby features/support/load_tests.rb

Installation

Add this line to your application's Gemfile:

gem 'csvlint'

And then execute:

$ bundle

Or install it yourself as:

$ gem install csvlint

Usage

You can either use this gem within your own Ruby code, or as a standalone command line application

On the command line

After installing the gem, you can validate a CSV on the command line like so:

csvlint myfile.csv

You may need to add the gem exectuable directory to your path, by adding '/usr/local/lib/ruby/gems/2.6.0/bin' or whatever your version is, to your .bash_profile PATH entry. like so

You will then see the validation result, together with any warnings or errors e.g.

myfile.csv is INVALID
1. blank_rows. Row: 3
1. title_row.
2. inconsistent_values. Column: 14

You can also optionally pass a schema file like so:

csvlint myfile.csv --schema=schema.json

Via pre-commit

Add to your .pre-commit-config.yaml file :

repos: # `pre-commit autoupdate` to get latest available tags

  - repo: https://github.com/Data-Liberation-Front/csvlint.rb
    rev: v1.2.0
    hooks:
      - id: csvlint

pre-commit install to enable it on your repository.

To force a manual run of pre-commit use the command :

pre-commit run -a

In your own Ruby code

Currently the gem supports retrieving a CSV accessible from a URL, File, or an IO-style object (e.g. StringIO)

require 'csvlint'

validator = Csvlint::Validator.new( "http://example.org/data.csv" )
validator = Csvlint::Validator.new( File.new("/path/to/my/data.csv" ))
validator = Csvlint::Validator.new( StringIO.new( my_data_in_a_string ) )

When validating from a URL the range of errors and warnings is wider as the library will also check HTTP headers for best practices

#invoke the validation
validator.validate

#check validation status
validator.valid?

#access array of errors, each is an Csvlint::ErrorMessage object
validator.errors

#access array of warnings
validator.warnings

#access array of information messages
validator.info_messages

#get some information about the CSV file that was validated
validator.encoding
validator.content_type
validator.extension
validator.row_count

#retrieve HTTP headers from request
validator.headers

Controlling CSV Parsing

The validator supports configuration of the CSV Dialect used in a data file. This is specified by passing a dialect hash to the constructor:

dialect = {
    "header" => true,
    "delimiter" => ","
}
validator = Csvlint::Validator.new( "http://example.org/data.csv", dialect )

The options should be a Hash that conforms to the CSV Dialect JSON structure.

While these options configure the parser to correctly process the file, the validator will still raise errors or warnings for CSV structure that it considers to be invalid, e.g. a missing header or different delimiters.

Note that the parser will also check for a header parameter on the Content-Type header returned when fetching a remote CSV file. As specified in RFC 4180 the values for this can be present and absent, e.g:

Content-Type: text/csv; header=present

Error Reporting

The validator provides feedback on a validation result using instances of Csvlint::ErrorMessage. Errors are divided into errors, warnings and information messages. A validation attempt is successful if there are no errors.

Messages provide context including:

Errors

The following types of error can be reported:

Warnings

The following types of warning can be reported:

Information Messages

There are also information messages available:

Schema Validation

The library supports validating data against a schema. A schema configuration can be provided as a Hash or parsed from JSON. The structure currently follows JSON Table Schema with some extensions and rudinmentary CSV on the Web Metadata.

An example JSON Table Schema schema file is:

{
    "fields": [
        {
            "name": "id",
            "constraints": {
                "required": true,
                "type": "http://www.w3.org/TR/xmlschema-2/#integer"
            }
        },
        {
            "name": "price",
            "constraints": {
                "required": true,
                "minLength": 1
            }
        },
        {
            "name": "postcode",
            "constraints": {
                "required": true,
                "pattern": "[A-Z]{1,2}[0-9][0-9A-Z]? ?[0-9][A-Z]{2}"
            }
        }
    ]
}

An equivalent CSV on the Web Metadata file is:

{
    "@context": "http://www.w3.org/ns/csvw",
    "url": "http://example.com/example1.csv",
    "tableSchema": {
        "columns": [
            {
                "name": "id",
                "required": true,
                "datatype": { "base": "integer" }
            },
            {
                "name": "price",
                "required": true,
                "datatype": { "base": "string", "minLength": 1 }
            },
            {
                "name": "postcode",
                "required": true
            }
        ]
    }
}

Parsing and validating with a schema (of either kind):

schema = Csvlint::Schema.load_from_json(uri)
validator = Csvlint::Validator.new( "http://example.org/data.csv", nil, schema )

CSV on the Web Validation Support

This gem passes all the validation tests in the official CSV on the Web test suite (though there might still be errors or parts of the CSV on the Web standard that aren't tested by that test suite).

JSON Table Schema Support

Supported constraints:

Supported data types (this is still a work in progress):

Use of an unknown data type will result in the column failing to validate.

Schema validation provides some additional types of error and warning messages:

Other validation options

You can also provide an optional options hash as the fourth argument to Validator#new. Supported options are:

options = {
  limit_lines: 100
}
validator = Csvlint::Validator.new( "http://example.org/data.csv", nil, nil, options )
    options = {
      lambda: ->(validator) { puts validator.current_line }
    }
    validator = Csvlint::Validator.new( "http://example.org/data.csv", nil, nil, options )
    => 1
    2
    3
    4
    .....