Update API to focus on variable relationships

The most recent revision attempts to make variable relationships clearer and obvious from the syntax. A nice consequence of this revision is that the conceptual differences between Tisane and existing software tools are more apparent.

Variables

An end-user expresses variables according to their data type. If the end-user later provides the data, the variable names should be the column names. For nominal or ordinal data, end-users also must specify the cardinality of variables if they do not intend to provide data. If end-users provide data, cardinality information is not required. In this case, Tisane will calculate and populate these fields internally.

Variables are observed values of a measure. Variables can be measures of interest, as in dependent and independent variables. Variables can also be id numbers that act as keys to a dataframe (e.g., participant id).

import tisane as ts

# Example 1: 
hw = ts.Numeric('Homework') # 'homework' is the column name
race = ts.Nominal('Race', cardinality=5) # there are 5 groups/options for the variable race
math = ts.Numeric('MathAchievement') 
mean_ses = ts.Numeric('Mean_SES')
student = ts.Nominal('student id', cardinality=100) # IDs 100 students included in this study 
school = ts.Nominal('school', cardinality=10) # IDs for schools, 10 students/school

# Example 2: 
leaf_length = ts.Numeric('length')
fertilizer = ts.Nominal('fertilizer condition', cardinality=2)
season = ts.Nominal('season', cardinality=4)
plant = ts.Nominal('plant id') 
bed = ts.Nominal('plant bed')

An end-user expresses relationships between variables that are related to domain theory (conceptual models) and data measurements.

Conceptual Relationships

There are two types of conceptual relationships: cause and associates_with

# Example 1
hw.cause(math) # Hours spent on homework causes math achievement. 
race.associates_with(math) # Math scores and race are associated with each other. 

# Example 2
fertilizer.cause(leaf_length) # Fertilizer causes leaf growth

Definitions:

cause: The LHS variable causes the RHS variable. The RHS variable cannot also cause the LHS variable.
associates_with: The LHS and RHS variables are associated/related in some way that is not causal.

Tisane provides aliases to both: causes and cause and associate_with and associates_with

Data measurement relationships

There are three types of data measurement relationships: (1) measurement attribution, (2) treatment for experiments, and (3) data hierarchies.

Measurement attribution

# Example 1: 
student.has(hw)
student.has(race)
student.has(math)
school.has(mean_ses)

# Example 2: 
plant.has(leaf_length)

Definition:

has distinguishes "levels" of observations by attributing variables to each level. In Example 1, there are two levels: student and school. Each student has a value for homework, race, and match. Each school has a value for mean_ses.

Idea: Create a separate Data type for "ID" and enforce that only variables of type "ID" can have other variables.

Treatment

End-users can express experimental treatments/manipulations.

# Example 2: 
fertilizer.treats(bed)

Only Example 2 is an experiment. Each bed is treated with a fertilizer. In other words, fertilizer is a bed-level manipulation.

Definition:

treats expresses the explicit/intentional manipulation of variables in an experiment. X.treats(Y) is internally equivalent to Y.has(X), which means that each Y has an observation for X.

Idea: Check that the LHS variable of treats has a causal relationship (in the graph) with the DV? And keeptreatsandhas` different from one another.

Data hierarchies

Data can be clustered or nested. Tisane provides support for expressing two possible sources of clustering: (1) repeated measures and (2) nested relationships.

# Example 1 
student.nest_under(school) # Students belong to a school. Students within a school might also cluster more than between schools. 

# Example 2 
plant.nest_under(bed) # Plants belong in plant beds. 
plant.repeats(measure=leaf_length, repetitions=season) # Repeatedly measure the same plant once per season

Definitions:

nest_under nests one variable under another.
repeats means the LHS variable provides multiple values of the measure. Each value is enumerated/indexed by the repetitions variable (e.g., season). If a plant provides multiple measures per season, another column for indexing each measure is required.

emjun / tisane