Quantipy / quantipy

Python for people data
MIT License
66 stars 14 forks source link

Quantipy

Python for people data

Quantipy is an open-source data processing, analysis and reporting software project that builds on the excellent pandas and numpy libraries. Aimed at people data, Quantipy offers support for native handling of special data types like multiple choice variables, statistical analysis using case or observation weights, DataFrame metadata and pretty data exports.

Key features

Contributors

Python 3 compatability

Efforts are underway to port Quantipy to Python 3 in a seperate repository.

Docs

View the documentation at readthedocs.org

Required libraries before installation

We recommend installing Anaconda for Python 2.7 which will provide most of the required libraries and an easy means of keeping them up-to-date over time.

Developing

Windows

Dependencies numpy and scipy are handled by conda. Create a virtual environment:

conda create -n envqp python=2.7 numpy==1.11.3 scipy==0.18.1

Install in editable mode:

pip install -r requirements_dev.txt

Linux

Dependencies numpy and scipy are handled in the installation.

Create a virtual environment:

conda create -n envqp python=2.7

Install in editable mode:

pip install -r requirements_dev.txt

5-minutes to Quantipy

Get started

Start a new folder called 'Quantipy-5' and add a subfolder called 'data'.

You can find an example dataset in quantipy/tests:

Put these files into your 'data' folder.

Start with some import statements:

import pandas as pd
import quantipy as qp

from quantipy.core.tools.dp.prep import frange

# This is a handy bit of pandas code to let you display your dataframes
# without having them split to fit a vertical column.
pd.set_option('display.expand_frame_repr', False)

Load, inspect and edit your data

Load the input files in a qp.DataSet instance and inspect the metadata with methods like .variables(), .meta() or .crosstab():

# Define the paths of your input files
path_json = './data/Example Data (A).json'
path_csv = './data/Example Data (A).csv'

dataset = qp.DataSet('Example Data (A)')
dataset.read_quantipy(path_json, path_csv)

dataset.crosstab('q2', text=True)
Question                                                           q2. Which, if any, of these other sports have you ever participated in?
Values                                                                                                                                   @
Question                                           Values
q2. Which, if any, of these other sports have y... All                                                         2999.0
                                                   Sky diving                                                  1127.0
                                                   Base jumping                                                1366.0
                                                   Mountain biking                                             1721.0
                                                   Kite boarding                                                649.0
                                                   Snowboarding                                                 458.0
                                                   Parachuting                                                  428.0
                                                   Other                                                        492.0
                                                   None of these                                                 53.0

Variables can be created, recoded or edited with DataSet methods, e.g. derive():

mapper = [(1,  'Any sports', {'q2': frange('1-6, 97')}),
          (98, 'None of these', {'q2': 98})]

dataset.derive('q2_rc', 'single', dataset.text('q2'), mapper)
dataset.meta('q2_rc')
single                                              codes          texts missing
q2_rc: Which, if any, of these other sports hav...
1                                                       1     Any sports    None
2                                                      98  None of these    None

The DataSet case data component can be inspected with the []-indexer, as known from a pd.DataFrame:


dataset[['q2', 'q2_rc']].head(5)
        q2  q2_rc
0  1;2;3;5;    1.0
1      3;6;    1.0
2       NaN    NaN
3       NaN    NaN
4       NaN    NaN

Analyse and create aggregations batchwise

A qp.Batch as a subclass of qp.DataSet is a container for structuring data analysis and aggregation specifications:

batch = dataset.add_batch('batch1')
batch.add_x(['q2', 'q2b', 'q5'])
batch.add_y(['gender', 'q2_rc'])

The batch definitions are stored in dataset._meta['sets']['batches']['batch1']. A qp.Stack can be created and populated based on the available qp.Batch definitions stored in the qp.DataSet:

stack = dataset.populate()
stack.describe()
                data     filter     x       y  view  #
0   Example Data (A)  no_filter   q2b       @   NaN  1
1   Example Data (A)  no_filter   q2b   q2_rc   NaN  1
2   Example Data (A)  no_filter   q2b  gender   NaN  1
3   Example Data (A)  no_filter    q2       @   NaN  1
4   Example Data (A)  no_filter    q2   q2_rc   NaN  1
5   Example Data (A)  no_filter    q2  gender   NaN  1
6   Example Data (A)  no_filter    q5       @   NaN  1
7   Example Data (A)  no_filter  q5_3       @   NaN  1
8   Example Data (A)  no_filter  q5_3   q2_rc   NaN  1
9   Example Data (A)  no_filter  q5_3  gender   NaN  1
10  Example Data (A)  no_filter  q5_2       @   NaN  1
11  Example Data (A)  no_filter  q5_2   q2_rc   NaN  1
12  Example Data (A)  no_filter  q5_2  gender   NaN  1
13  Example Data (A)  no_filter  q5_1       @   NaN  1
14  Example Data (A)  no_filter  q5_1   q2_rc   NaN  1
15  Example Data (A)  no_filter  q5_1  gender   NaN  1
16  Example Data (A)  no_filter  q5_6       @   NaN  1
17  Example Data (A)  no_filter  q5_6   q2_rc   NaN  1
18  Example Data (A)  no_filter  q5_6  gender   NaN  1
19  Example Data (A)  no_filter  q5_5       @   NaN  1
20  Example Data (A)  no_filter  q5_5   q2_rc   NaN  1
21  Example Data (A)  no_filter  q5_5  gender   NaN  1
22  Example Data (A)  no_filter  q5_4       @   NaN  1
23  Example Data (A)  no_filter  q5_4   q2_rc   NaN  1
24  Example Data (A)  no_filter  q5_4  gender   NaN  1

Each of the defintions is a qp.Link. These can be e.g. analyzed in various ways, e.g. grouped categories can be calculated using the engine qp.Quantity:

link = stack[dataset.name]['no_filter']['q2']['q2_rc']
q = qp.Quantity(link)
q.group(frange('1-6, 97'), axis='x', expand='after')
q.count()
Question          q2_rc
Values              All       1    98
Question Values
q2       All     2999.0  2946.0  53.0
         net     2946.0  2946.0   0.0
         1       1127.0  1127.0   0.0
         2       1366.0  1366.0   0.0
         3       1721.0  1721.0   0.0
         4        649.0   649.0   0.0
         5        458.0   458.0   0.0
         6        428.0   428.0   0.0
         97       492.0   492.0   0.0

We can also simply add so called qp.Views to the whole of the qp.Stack:

stack.aggregate(['counts', 'c%'], False, verbose=False)
stack.add_stats('q2b', stats=['mean'], rescale={1: 100, 2:50, 3:0}, verbose=False)

stack.describe('view', 'x')
x                                q2  q2b   q5  q5_1  q5_2  q5_3  q5_4  q5_5  q5_6
view
x|d.mean|x[{100,50,0}]:|||stat  NaN  3.0  NaN   NaN   NaN   NaN   NaN   NaN   NaN
x|f|:|y||c%                     3.0  3.0  1.0   3.0   3.0   3.0   3.0   3.0   3.0
x|f|:|||counts                  3.0  3.0  1.0   3.0   3.0   3.0   3.0   3.0   3.0
link = stack[dataset.name]['no_filter']['q2b']['q2_rc']
link['x|d.mean|x[{100,50,0}]:|||stat']
Question             q2_rc
Values                  1          98
Question Values
q2b      mean    52.354167  43.421053

More examples

There is so much more you can do with Quantipy... why don't you explore the docs to find out!