chronicle-app/chronicle-etl

A CLI toolkit for extracting and working with your digital history

chronicle-etl-banner

Are you trying to archive your digital history or incorporate it into your own projects? You’ve probably discovered how frustrating it is to get machine-readable access to your own data. While building a memex, I learned first-hand what great efforts must be made before you can begin using the data in interesting ways.

If you don’t want to spend all your time writing scrapers, reverse-engineering APIs, or parsing export data, this tool is for you! (If you do enjoy these things, please see the open issues.)

chronicle-etl is a CLI tool that gives you a unified interface to your personal data. It uses the ETL pattern to extract data from a source (e.g. your local browser history, a directory of images, goodreads.com reading history), transform it (into a given schema), and load it to a destination (e.g. a CSV file, JSON, external API).

What does `chronicle-etl` give you?

A CLI tool for working with personal data. You can monitor progress of exports, manipulate the output, set up recurring jobs, manage credentials, and more.
Plugins for many third-party sources (see list). This plugin system allows you to access data from dozens of third-party services, all accessible through a common CLI interface.
A common, opinionated schema: You can normalize different datasets into a single schema so that, for example, all your iMessages and emails are represented in a common schema. (Don’t want to use this schema? chronicle-etl always allows you to fall back on working with the raw extraction data.)

Chronicle-ETL in action

demo

Longer screencast

Installation

Using homebrew:

$ brew install chronicle-app/etl/chronicle-etl

Using rubygems:

$ gem install chronicle-etl

Confirm it installed successfully:

$ chronicle-etl --version

Basic usage and running jobs

# Display help
$ chronicle-etl help

# Run a basic job
$ chronicle-etl --extractor NAME --transformer NAME --loader NAME

# Read test.csv and display it to stdout as a table
$ chronicle-etl --extractor csv --input data.csv --loader table

# Show available plugins and install one
$ chronicle-etl plugins:list
$ chronicle-etl plugins:install imessage

# Retrieve imessage messages from the last 5 hours
$ chronicle-etl -e imessage --since 5h

# Get email senders from an .mbox email archive file
$ chronicle-etl --extractor email:mbox -i sample-email-archive.mbox -t email --fields actor.slug

# Save an access token as a secret and use it in a job
$ chronicle-etl secrets:set pinboard access_token username:foo123
$ chronicle-etl secrets:list # Verify that's it's available
$ chronicle-etl -e pinboard --since 1mo # Used automatically based on plugin name

Common options

Options:
  -e, [--extractor=NAME]                 # Extractor class. Default: stdin
      [--extractor-opts=key:value]       # Extractor options
  -t, [--transformer=NAME]               # Transformer class. Default: null
      [--transformer-opts=key:value]     # Transformer options
  -l, [--loader=NAME]                    # Loader class. Default: json
      [--loader-opts=key:value]          # Loader options
  -i, [--input=FILENAME]                 # Input filename or directory
      [--since=DATE]                     # Load records SINCE this date (or fuzzy time duration)
      [--until=DATE]                     # Load records UNTIL this date (or fuzzy time duration)
      [--limit=N]                        # Only extract the first LIMIT records
      [--schema=SCHEMA_NAME]             # Which Schema to transform
                                         # Possible values: chronicle, activitystream, schemaorg, chronobase
      [--format=SCHEMA_NAME]              # How to serialize results
                                          # Possible values: jsonapi, jsonld
  -o, [--output=OUTPUT]                  # Output filename
      [--fields=field1 field2 ...]       # Output only these fields
      [--header-row], [--no-header-row]  # Output the header row of tabular output

      [--log-level=LOG_LEVEL]            # Log level (debug, info, warn, error, fatal)
                                         # Default: info
  -v, [--verbose], [--no-verbose]        # Set log level to verbose
      [--silent], [--no-silent]          # Silence all output

Saving a job

You can save details about a job to a local config file (saved by default in ~/.config/chronicle/etl/jobs/JOB_NAME.yml) to save yourself the trouble specifying options each time.

# Save a job named 'sample' to ~/.config/chronicle/etl/jobs/sample.yml
$ chronicle-etl jobs:save sample --extractor pinboard --since 10d

# Run the job
$ chronicle-etl jobs:run sample

# Show details about the job
$ chronicle-etl jobs:show sample

# Edit a job definition with default editor ($EDITOR)
$ chronicle-etl jobs:edit sample

# Show all saved jobs
$ chronicle-etl jobs:list

Connectors and plugins

Connectors let you work with different data formats or third-party sources.

Built-in Connectors

chronicle-etl comes with several built-in connectors for common formats and sources.

# List all available connectors
$ chronicle-etl connectors:list

Extractors

csv - Load records from CSV files or stdin
json - Load JSON (either line-separated objects or one object)
file - load from a single file or directory (with a glob pattern)

Transformers

null - (default) Don’t do anything and pass on raw extraction data
sampler - Sample percent records from the extraction
sort - sort extracted results by key and direction

Loaders

json - (default) Load records serialized as JSON
table - Output an ascii table of records. Useful for exploring data.
csv - Load records to CSV
rest - Send JSON to a REST API

Chronicle Plugins for third-party services

Plugins provide access to data from third-party platforms, services, or formats. Plugins are packaged as separate gems and can be installed through the CLI (under the hood, it's a gem install chronicle-PLUGINNAME)

Plugin usage

# List available plugins
$ chronicle-etl plugins:list

# Install a plugin
$ chronicle-etl plugins:install NAME

# Use a plugin
$ chronicle-etl plugins:install imessage
$ chronicle-etl --extractor imessage --limit 10

# Uninstall a plugin
$ chronicle-etl plugins:uninstall NAME

Available plugins and connectors

The following are the officially-supported list of plugins and their available connectors:

Plugin	Type	Identifier	Description
apple-podcasts	extractor	listens	listening history of podcast episodes
apple-podcasts	transformer	listen	a podcast episode listen to Chronicle Schema
email	extractor	imap	emails over an IMAP connection
email	extractor	mbox	emails from an .mbox file
email	transformer	email	email to Chronicle Schema
foursquare	extractor	checkins	Foursqure visits
foursquare	transformer	checkin	checkin to Chronicle Schema
github	extractor	activity	user activity stream
imessage	extractor	messages	imessages from local macOS
imessage	transformer	message	imessage to Chronicle Schema
pinboard	extractor	bookmarks	Pinboard.in bookmarks
pinboard	transformer	bookmark	bookmark to Chronicle Schema
safari	extractor	browser-history	browser history
safari	transformer	browser-history	browser history to Chronicle Schema
shell	extractor	history	shell command history (bash / zsh)
shell	transformer	command	command to Chronicle Schema
spotify	extractor	liked-tracks	liked tracks
spotify	extractor	saved-albums	saved albums
spotify	extractor	listens	recently listened tracks (last 50 tracks)
spotify	transformer	like	like to Chronicle Schema
spotify	transformer	listen	listen to Chronicle Schema
spotify	authorizer		OAuth authorizer
zulip	extractor	private-messages	private messages
zulip	transformer	message	message to Chronicle Schema

Coming soon

A few dozen importers exist in my Memex project and I'm porting them over to the Chronicle system. The Chronicle Plugin Tracker lets you keep track what's available and what's coming soon.

If you don't see a plugin for a third-party provider or data source that you're interested in using with chronicle-etl, please open an issue. If you want to work together on a plugin, please get in touch!

In summary, the following are coming soon: anki, arc, bear, chrome, facebook, firefox, fitbit, foursquare, git, github, goodreads, google-calendar, images, instagram, lastfm, shazam, slack, strava, timing, things, twitter, whatsapp, youtube.

Writing your own plugin

Additional connectors are packaged as separate ruby gems. You can view the iMessage plugin for an example.

If you want to load a custom connector without creating a gem, you can help by completing this issue.

If you want to work together on a connector, please get in touch!

Sample custom Extractor class

# TODO

Secrets Management

If your job needs secrets such as access tokens or passwords, chronicle-etl has a built-in secret management system.

Secrets are organized in namespaces. Typically, you use one namespace per plugin (pinboard secrets for the pinboard plugin). When you run a job that uses the pinboard plugin extractor, for example, the secrets from that namespace will automatically be included in the extractor's options. To override which secrets get included, you can use do it in the connector options with secrets: ALT-NAMESPACE.

Under the hood, secrets are stored in ~/.config/chronicle/etl/secrets/NAMESPACE.yml with 0600 permissions on each file.

Using the secret manager

# Save a secret under the 'pinboard' namespace
$ chronicle-etl secrets:set pinboard access_token username:foo123

# Set a secret using stdin
$ echo -n "username:foo123" | chronicle-etl secrets:set pinboard access_token

# List available secretes
$ chronicle-etl secrets:list

# Use 'pinboard' secrets in the pinboard extractor's options (happens automatically)
$ chronicle-etl -e pinboard --since 1mo

# Use a custom secrets namespace
$ chronicle-etl secrets:set pinboard-alt access_token different-username:foo123
$ chronicle-etl -e pinboard --extractor-opts secrets:pinboard-alt --since 1mo

# Remove a secret
$ chronicle-etl secrets:unset pinboard access_token

Roadmap

Keep tackling new plugins. See: Chronicle Plugin Tracker
Add support for incremental extractions (#37)
Improve stdin extractor and shell command transformer so that users can easily integrate their own scripts/languages/tools into jobs (#5)
Add documentation for Chronicle Schema. It's found throughout this project but never explained.

Development

After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release, which will create a git tag for the version, push git commits and tags, and push the .gem file to rubygems.org.

Additional development commands

# run tests
bundle exec rake spec

# generate docs
bundle exec rake yard

# use Guard to run specs automatically
bundle exec guard

Get in touch

@hyfen on Twitter
@hyfen on Github
Email: andrew@hyfen.net

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/chronicle-app/chronicle-etl. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.

License

The gem is available as open source under the terms of the MIT License.

Code of Conduct

Everyone interacting in the Chronicle::ETL project’s codebases, issue trackers, chat rooms and mailing lists is expected to follow the code of conduct.