embulk / embulk-filter-column

A filter plugin for Embulk to filter out columns
Apache License 2.0
44 stars 10 forks source link

Column filter plugin for Embulk

Build Status

A filter plugin for Embulk to filter out columns

Configuration

Example - columns

Say input.csv is as follows:

time,id,key,score
2015-07-13,0,Vqjht6YE,1370
2015-07-13,1,VmjbjAA0,3962
2015-07-13,2,C40P5H1W,7323
filters:
  - type: column
    columns:
      - {name: time, default: "2015-07-13", format: "%Y-%m-%d"}
      - {name: id}
      - {name: key, default: "foo"}

reduces columns to only time, id, and key columns as:

time,id,key
2015-07-13,0,Vqjht6YE
2015-07-13,1,VmjbjAA0
2015-07-13,2,C40P5H1W

Note that column types are automatically retrieved from input data (inputSchema).

Example - add_columns

Say input.csv is as follows:

time,id,key,score
2015-07-13,0,Vqjht6YE,1370
2015-07-13,1,VmjbjAA0,3962
2015-07-13,2,C40P5H1W,7323
filters:
  - type: column
    add_columns:
      - {name: d, type: timestamp, default: "2015-07-13", format: "%Y-%m-%d"}
      - {name: copy_id, src: id}

add d column, and copy_id column which is a copy of id column as:

time,id,key,score,d,copy_id
2015-07-13,0,Vqjht6YE,1370,2015-07-13,0
2015-07-13,1,VmjbjAA0,3962,2015-07-13,1
2015-07-13,2,C40P5H1W,7323,2015-07,13,2

Example - drop_columns

Say input.csv is as follows:

time,id,key,score
2015-07-13,0,Vqjht6YE,1370
2015-07-13,1,VmjbjAA0,3962
2015-07-13,2,C40P5H1W,7323
filters:
  - type: column
    drop_columns:
      - {name: time}
      - {name: id}

drop time and id columns as:

key,score
Vqjht6YE,1370
VmjbjAA0,3962
C40P5H1W,7323

JSONPath

For type: json column, you can specify JSONPath for column's name as:

- {name: $.payload.key1}
- {name: "$.payload.array[0]"}
- {name: "$.payload.array[*]"}
- {name: $['payload']['key1.key2']}

EXAMPLE:

Following operators of JSONPath are not supported:

Note that type: timesatmp for add_columns or columns is not available because Embulk's type: json cannot have timestamp column inside.

Also note that renameing or copying of json paths by src option is only partially supported yet. The parent json path must be same like:

- {name: $.payload.foo.dest, src: $.payload.foo.src}

I mean that below example does not work yet ($.payload.foo and $.payload.bar)

- {name: $.payload.foo.dest, src: $.payload.bar.src}

Development

Run example:

$ ./gradlew gem
$ embulk preview -I build/gemContents/lib example/example.yml

Run test:

$ ./gradlew test

Run test with coverage reports:

$ ./gradlew test jacocoTestReport

open build/reports/jacoco/test/html/index.html

Run checkstyle:

$ ./gradlew check

Run only checkstyle:

$ ./gradlew checkstyleMain
$ ./gradlew checkstyleTest

For Maintainers

Release

Modify version in build.gradle at a detached commit, and then tag the commit with an annotation.

git checkout --detach master

(Edit: Remove "-SNAPSHOT" in "version" in build.gradle.)

git add build.gradle

git commit -m "Release vX.Y.Z"

git tag -a vX.Y.Z

(Edit: Write a tag annotation in the changelog format.)

See Keep a Changelog for the changelog format. We adopt a part of it for Git's tag annotation like below.

## [X.Y.Z] - YYYY-MM-DD

### Added
- Added a feature.

### Changed
- Changed something.

### Fixed
- Fixed a bug.

Push the annotated tag, then. It triggers a release operation on GitHub Actions after approval.

git push -u origin vX.Y.Z