ballet / predict-life-outcomes

Collaborating to solve the Fragile Families Challenge using the Ballet framework
5 stars 11 forks source link
ballet feature-engineering fragile-families

ballet slack Join the chat at https://gitter.im/ballet-project/fragile-families

Fragile Families Collaboration

This is a collaborative predictive modeling project built on the ballet framework.

The Fragile Families Challenge (FFC) is a recent attempt to better connect to the social science research community to new tools in data science and machine learning. This challenge aimed to spur the development of predictive models for life outcomes from data collected as part of the Fragile Families and Child Wellbeing Study (FFCWS), which collects detailed longitudinal records on a set of disadvantaged children and their families. Organizers released anonymized and merged data on a set of 4,242 families with data collected from the birth of the child until age 9. Participants in the challenge were then tasked with predicting six life outcomes of the child or family at age 15: child grade point average, child grit, household eviction, household material hardship, primary caregiver layoff, and primary caregiver participation in job training. The FFC was run over a four month period in 2017 and received 160 submissions from social scientists, machine learning practitioners, students, and others.

In this project, we ask, by collaborating rather than competing, can we develop impactful solutions to the FFC? Participants in the FFC were competing against each other to produce the best performing models, at the expense of collaboration across teams.

Your task is to create and submit feature definitions to our shared project that help us in predicting these key life outcomes.

Join the collaboration

Are you interested in joining the collaboration?

  1. Apply for access to the dataset and then register yourself with us.
  2. Read/skim the Ballet Contributor Guide.
  3. Read/skim the Ballet Feature Engineering Guide.
  4. Learn more about the Fragile Families dataset.
    1. Read/skim the data documentation.
    2. Skim additional resources.
  5. Browse the currently accepted features in the contributed features directory (src/fragile_families/features/contrib).
  6. Launch an interactive Jupyter Lab session to hack on this repository:

Data access

The data underlying the Fragile Families Challenge, which we are using in this collaboration, is sensitive and requires registration to access.

If you are already authorized to access the data, you can look over Data Documentation below.

Apply for access and registration

You must apply to Princeton's Office of Population Research (OPR) for access to the Fragile Families Challenge dataset.

:envelope: Follow instructions here to apply for access

The Fragile Families Challenge dataset contains sensitive information. You should keep this dataset secure and protect the privacy of the individuals, and abide by the data access agreement which requires you not to share your copy of the dataset.

You must register with us to join the collaboration, once you have been granted access to the data from Princeton OPR (or if you had already had access to the data from prior research). (This is step 7 in the instructions above, so don't repeat it if you already filled out the form.)

:raised_hand: Register here!

Authentication

Your AWS access key ID/secret will be automatically detected from standard locations (such as environment variables or credentials files).

If you are working in a notebook without access to other methods of configuration (such as using Assemblé) you can do the following in a code cell:

import os
os.environ['AWS_ACCESS_KEY_ID'] = 'your access key id'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'your secret access key'

Alternatively, if you are working locally, you can create a new AWS profile in ~/.aws/credentials:

[bff]
aws_access_key_id = your-access-key-id
aws_secret_access_key = your-secret-access-key

Then you can use this profile when you are developing features for this project, by exporting the environment variable AWS_PROFILE=bff (or using the os.environ approach similar to above).

Data documentation

The full challenge dataset contains a "background" table of 4,242 rows (one per child in the training set) and 12,942 columns.

Train split

The "train" split contains 2,121 rows (half of the background set) and 7 additional columns:

These six outcome variables are the outcomes that we are trying to predict.

:bulb: For the purpose of validating feature contributions, we will focus on the materialHardship prediction problem. However, we want our feature definitions to be useful for all six prediction problems.

You can load the train split as follows:

from ballet import b
X_df, y_df = b.api.load_data()

Leaderboard and test splits

The other half of the rows are reserved for the "leaderboard" and "test" splits. We will use the leaderboard split to validate feature contributions. We will not look at the test split until the end of the collaboration.

If you'd like, you can load the full "background" dataset which includes the rows from the train, leaderboard, and test splits combined, but excluding the target columns.

from fragile_families.load_data import load_background
background_df = load_background()

Background variables

(This section is adapted from here)

To use the data, it may be useful to know something about what each variable (column) represents. (See also the full documentation.)

Waves and child ages

The background variables were collected in 5 waves.

Note that wave numbers are not the same as child ages. The variable names and survey documentation are organized by wave number.

Variable naming conventions

Predictor variables are identified by a prefix and a question number. Prefixes the survey in which a question was collected. This is useful because the documentation is organized by survey. For instance the variable m1a4 refers to the mother interview in wave 1, question a4.

  1. The prefix c in front of any variable indicates variables constructed from other responses. For instance, cm4b_age is constructed from the mother wave 4 interview, and captures the child's age (baby's age).
  2. m1, m2, m3, m4, m5: Questions asked of the child's mother in wave 1 through wave 5.
  3. f1, f2, f3, f4, f5: Questions asked of the child's father in wave 1 through wave 5
  4. hv3, hv4, hv5: Questions asked in the home visit in waves 3, 4, and 5.
  5. p5: Questions asked of the primary caregiver in wave 5.
  6. k5: Questions asked of the child (kid) in wave 5
  7. ffcc: Questions asked in various child care provider surveys in wave 3
  8. kind: Questions asked of the kindergarten teacher in wave 4
  9. t5: Questions asked of the teacher in wave 5.
  10. n5: Questions asked of the non-parental caregiver in wave 5

Full codebook

We expose the full machine-readable codebook which you can use during feature development.

from fragile_families.load_data import load_codebook
codebook_df = load_codebook()

Metadata search

We also wrap the ffmetadata API for our own use in feature development. The metadata API returns more detailed metadata than is available in the codebook. See here for details on the filter operations and see here for an explanation of the resulting metadata.

import fragile_families.analysis.metadata as metadata
metadata.info('m1a4')
metadata.search({'name': 'label', 'op': 'like', 'val': '%school%'})
# can use metadata.searchinfo to combine the two methods

Metadata search changes

The metadata search shows results from the most up-to-date metadata available. In some cases, this reflects changes since the 2017 challenge, so variables that appear in metadata search may not appear in the dataset and vice-versa. In this case, the renamed variable's old_name attribute is set to the previous name.

For example, kind_a2 was renamed to t4a2.

If metadata.info method receives an error from the metadata API due to a missing variable, it will automatically retry by first searching for a variable with that old name and then getting info for that old variable. You can disable this behavior with retry_with_old_name=False.

Feature development

Feature development partitions

A feature development partition describes a set of inputs for a data scientist to focus on in engineering features in this project. For example, the set of all questions asked during Wave 1 of the survey is a partition.

If you'd like to focus your effort in feature development, check out the existing partitions, which are tracked in issues under the feature-partition label. Comment on the issue with the response me to "claim" it. It's okay for multiple people to claim one partition, but in that case, make sure you stay in touch directly or via the project chat, or follow each other's accepted (and rejected) feature contributions.

If you'd like to suggest a new partition, see #31.

Feature validation

In this project, feature contributions are validated to ensure that they are positively contributing to our shared feature engineering pipeline. One part of this validation is called "feature acceptance" validation, that is, does the performance of our ML pipeline improve when the new feature is added? We run the feature through two feature accepters: the MutualInformationAccepter and the VarianceThresholdAccepter. Based on the parameters set in our ballet.yml configuration file, a feature definition is accepted if it meets two criteria:

Discussion and help

Want to chat about the project, compare ideas, or debug features with other collaborators? Join either of our two chat rooms:

If you think a question might have been answered before, check out the Ballet FAQ.

If you think you found a bug with Ballet, please open an issue and mention that you are working on the predict-life-outcomes project.

Additional resources