arsarabi / jsonvectorizer

Tools for extracting vector representations of JSON documents
MIT License
30 stars 5 forks source link
feature-extraction json json-schema vectorization

=============== JSON Vectorizer

.. image:: https://readthedocs.org/projects/jsonvectorizer/badge/?version=latest :target: http://jsonvectorizer.readthedocs.io

.. image:: https://img.shields.io/badge/License-MIT-blue.svg :target: ./LICENSE

.. sphinx-start

Overview

This package contains tools for extracting vector representations of JSON documents, for various machine learning applications. The implementation follows a scikit-learn-like interface, and uses three stages for extracting features, summarized as follows:

Notes

Installation

Install using:

.. code-block:: sh

python setup.py install

Usage

The following example shows how one can build vectorizers for JSON documents using the aforementioned three-stage process. You can further customize and fine-tune the parameters in each stage for your specfic data set.

First we instantiate a JsonVectorizer object, and learn the schema of, and collect samples from, a set of JSON documents stored in a file (with one record per line):

.. code-block:: python

import json
from jsonvectorizer import JsonVectorizer, vectorizers
from jsonvectorizer.utils import fopen

# Load data
docs = []
with utils.fopen('samples.json.gz') as f:
    for line in f:
        doc = json.loads(line)
        docs.append(doc)

# Learn the schema of sample documents
vectorizer = JsonVectorizer()
vectorizer.extend(docs)

We then prune fields that are present for less than 1% of all observed samples, and also those starting with an underscore:

.. code-block:: python

vectorizer.prune(patterns=['^_'], min_f=0.01)

Finally, we create a list of vectorizers for individual data types, and use them to build a vectorizer for JSON documents:

.. code-block:: python

# Report booleans as is
bool_vectorizer = {
    'type': 'boolean',
    'vectorizer': vectorizers.BoolVectorizer
}

# For numbers, use one-hot encoding with 10 bins
number_vectorizer = {
    'type': 'number',
    'vectorizer': vectorizers.NumberVectorizer,
    'kwargs': {'n_bins': 10},
}

# For strings use tokenization, ignoring sparse (<1%) tokens
string_vectorizer = {
    'type': 'string',
    'vectorizer': vectorizers.StringVectorizer,
    'kwargs': {'min_df': 0.01}
}

# Build JSON vectorizer
vectorizers = [
    bool_vectorizer,
    number_vectorizer,
    string_vectorizer
]
vectorizer.fit(vectorizers=vectorizers)

The generated features can be inspected by printing the following property:

.. code-block:: python

for i, feature_name in enumerate(vectorizer.feature_names_):
    print('{}: {}'.format(i, feature_name))

The constructed vectorizer can then compute feature vectors from any set of JSON documents, generating SciPy List of Lists (LIL) sparse matrices:

.. code-block:: python

# Convert to CSR format for efficient row slicing
X = vectorizer.transform(docs).tocsr()

Note that vectorizer objects are picklable, which means they can be stored on disk, and later be loaded in a separate session:

.. code-block:: python

import pickle

# Saving
with open('vectorizer.pkl', 'wb') as f:
    pickle.dump(vectorizer, f)

# Loading
with open('vectorizer.pkl', 'rb') as f:
    vectorizer = pickle.load(f)

To-Do

.. _JSON schema: https://spacetelescope.github.io/understanding-json-schema