cytomining / profiling-recipe

Image-based Profiling Recipe
BSD 3-Clause "New" or "Revised" License
8 stars 26 forks source link

Image-based Profiling Recipe

Description

The profiling-recipe is a collection of scripts that primarily use pycytominer functions to process single-cell morphological profiles. The three scripts are

Getting Started

Requirements

Anaconda

We use Anaconda as our package manager. Install Miniconda following the instructions here.

AWS CLI

We use AWS S3 storage tracked by DVC for large file management. Install AWS CLI following the instructions here. After installation, configure your AWS CLI installation with

aws configure

It will prompt you for:
AWS Access Key ID: YOUR-KEY
AWS Secret Access Key: YOUR-SECRET-KEY
Default region name: e.g. us-east-1
Default output format: json

Note that the profile that you enter must have permission to upload to the bucket that you set later.

If you do not want to store/version large files on AWS, you can skip AWS CLI installation.

System requirements

If the profiling pipeline is used for aggregating the single cell profiles, we recommend running the pipeline on a system with a memory at least twice the size of the .sqlite files. If the pipeline will be used only for running steps that are downstream of aggregation, then it can be run on a local machine.

Creating the folder structure

The pipeline requires a particular folder structure which can be created as follows

PROJECT_NAME="INSERT-PROJECT-NAME"
mkdir -p ~/work/projects/${PROJECT_NAME}/workspace/{backend,software}

Downloading the data

Download the .sqlite file, which contains the single cell profiles and the .csv file, which contains the well-level aggregated profiles, for all the plates in each batch to the backend folder. These two files are created by running the commands in chapter 5.3 of the profiling handbook. After downloading the files, the folder structure should look as follows

backend
├── batch1
│   ├── plate1
│   │   ├── plate1.csv
│   │   └── plate1.sqlite
│   └── plate2
│       ├── plate2.csv
│       └── plate2.sqlite
└── batch2
    ├── plate1
    │   ├── plate1.csv
    │   └── plate1.sqlite
    └── plate2
        ├── plate2.csv
        └── plate2.sqlite

Note: Aggregation may not have been performed already. In that case, only the .sqlite file will be available for download.

Welding to the data repository (Recommended)

Welding is a process by which the profiling-recipe repository is added to the data repository as a submodule such that the files in the data repository and the scripts in the profiling-recipe that generated those files, are versioned together. Instructions for welding is provided here. We highly recommend welding the profiling-recipe to the data repository. After welding, clone the data repository to the software folder, using the command

cd ~/work/projects/${PROJECT_NAME}/workspace/software
git clone <location of the data repository>.git
cd <name of the data repository>
git submodule update --init --recursive

After cloning, the folder structure should look as follows

software
└── data_repo
    ├── LICENSE
    ├── README.md
    └── profiling-recipe
        ├── LICENSE
        ├── README.md
        ├── config_template.yml
        ├── environment.yml
        ├── profiles
        │   ├── profile.py
        │   ├── profiling_pipeline.py
        │   └── utils.py
        └── scripts
            ├── create_dirs.sh
            └── csv2gz.py

Running the recipe without welding

It is possible to run the pipeline without welding the profiling-recipe repository to the data repository. If run in this manner, the profiling-recipe repository should be cloned to a data directory as a subdirectory. Then the folder structure will look as follows.

software
└── data_directory
    ├── LICENSE
    ├── README.md
    └── profiling-recipe
        ├── LICENSE
        ├── README.md
        ├── config_template.yml
        ├── environment.yml
        ├── profiles
        │   ├── profile.py
        │   ├── profiling_pipeline.py
        │   └── utils.py
        └── scripts
            ├── create_dirs.sh
            └── csv2gz.py

Downloading load_data_csv

Generating summary statistics table requires load_data_csv files. These files should be downloaded to the load_data_csv directory. Make sure the files are gzipped, and the folder structure looks as follows.

load_data_csv/
├── batch1
│   ├── plate1
│   │   ├── load_data.csv.gz
│   │   └── load_data_with_illum.csv.gz
│   └── plate2
│       ├── load_data.csv.gz
│       └── load_data_with_illum.csv.gz
└── batch2
    ├── plate1
    │   ├── load_data.csv.gz
    │   └── load_data_with_illum.csv.gz
    └── plate2
        ├── load_data.csv.gz
        └── load_data_with_illum.csv.gz

Running the pipeline

The pipeline should be run from the data repository or data directory.

DATA="INSERT-NAME-OF-DATA-REPO-OR-DIR"
cd ~/work/projects/${PROJECT_NAME}/workspace/software/${DATA}/

Setting up the conda environment

Copy the environment.yml file

The environment.yml file contains the list of conda packages that are necessary for running the pipeline.

cp profiling-recipe/environment.yml .

Create the environment

conda env create --force --file environment.yml

Activate the conda environment

conda activate profiling

Setting up DVC

Initialize DVC for this project and set it to store large files in S3. Skip this step if not using DVC. If you would like to use DVC for a remote storage location that is not S3, find instructions here. If you have multiple AWS profiles on your machine and do not want to use the default one for DVC, you can specify which profile to use by running dvc remote modify S3storage profile PROFILE_NAME at any point between adding the remote and performing the final DVC push.

# Navigate
cd ~/work/projects/${PROJECT_NAME}/workspace/software/<data_repo>
# Initialize DVC
dvc init
# Set up remote storage
dvc remote add -d S3storage s3://<bucket>/projects/${PROJECT_NAME}/workspace/software/<data_repo>_DVC
# Commit new files to git
git add .dvc/.gitignore .dvc/config
git commit -m "Setup DVC"

Create the directories

The directories that will contain the output of the pipeline are created as follows

profiling-recipe/scripts/create_dirs.sh

Metadata, platemap and barcode_platemap files

The pipeline requires barcode_platemap.csv and platemap.txt to run. An optional external_metadata.tsv can also be provided for additional annotation. The following are the descriptions of each of these files

The following is an example of the barcode_platemap.csv file

Assay_Plate_Barcode,Plate_Map_Name
plate1,platemap
plate2,platemap

Here is an example plate map file and here is an example external metadata file.

These files should be added to the appropriate folder so that the folder structure looks as below

metadata
├── external_metadata
│   └── external_metadata.tsv
└── platemaps
    ├── batch1
    │   ├── barcode_platemap.csv
    │   └── platemap
    │       └── platemap.txt
    └── batch2
        ├── barcode_platemap.csv
        └── platemap
            └── platemap.txt

Copy the config file

CONFIG_FILE="INSERT-CONFIG-FILE-NAME"
cd ~/work/projects/${PROJECT_NAME}/workspace/software/${DATA}/
cp profiling-recipe/config_template.yml config_files/${CONFIG_FILE}.yml

The config file contains all the parameters that various pycytominer functions, called by the profiling pipeline, require. To run the profiling pipeline with different parameters, multiple config files can be created. Each parameter in the config file is described below. All the necessary changes to the config file must be made before the pipeline can be run.

Copy aggregated profile

If the first step of the profiling pipeline, aggregate, has already been performed (in the backend folder, there is a .csv file in addition to .sqlite file) then the .csv file has to be copied to the data repository or data directory. If not, skip to Running the profiling pipeline.

Run the following commands for each batch separately. These commands create a folder for each batch, compress the .csv files, and then copy them to the data repository or data directory.

BATCH="INSERT-BATCH-NAME"
mkdir -p profiles/${BATCH}
find ../../backend/${BATCH}/ -type f -name "*.csv" -exec profiling-recipe/scripts/csv2gz.py {} \;
rsync -arzv --include="*/" --include="*.gz" --exclude "*" ../../backend/${BATCH}/ profiles/${BATCH}/

Running the profiling pipeline

After making the necessary changes to the config.yml file, run the profiling pipeline as follows

python profiling-recipe/profiles/profiling_pipeline.py  --config config_files/${CONFIG_FILE}.yml

If there are multiple config files, each one of them can be run one after the other using the above command.

Note: Each step in the profiling pipeline, uses the output from the previous step as its input. Therefore, make sure that all the necessary input files have been generated before running the steps in the profiling pipeline. It is possible to run only a few steps in the pipeline by keeping only those steps in the config file.

Push the profiles to GitHub

If using a data repository, push the newly created profiles to DVC and the .dvc files and other files to GitHub as follows

dvc add profiles/${BATCH} --recursive
dvc push
git add profiles/${BATCH}/*/*.dvc profiles/${BATCH}/*/*.gitignore
git commit -m 'add profiles'
git add *
git commit -m 'add files made in profiling'
git push

If not using DVC but using a data repository, push all new files to GitHub as follows

git add *
git commit -m 'add profiles'
git push

Files generated

Running the profiling workflow with all the steps included generates the following files

Filename Description Location
<PLATE>.csv.gz Aggregated well-level profiles profiles/BATCH/PLATE
<PLATE>_augmented.csv.gz Metadata annotated profiles profiles/BATCH/PLATE
<PLATE>_normalized.csv.gz Profiles normalized to the whole plate profiles/BATCH/PLATE
<PLATE>_normalized_negcon.csv.gz Profiles normalized to the negative control profiles/BATCH/PLATE
<PLATE>_normalized_feature_select_<LEVEL>.csv.gz Whole plate normalized profiles that are feature selected at the plate, batch or all plates level profiles/BATCH/PLATE
<PLATE>_normalized_feature_select_negcon_<LEVEL>.csv.gz Negative control normalized profiles that are feature selected at the plate, batch or all plates level profiles/BATCH/PLATE
<BATCH>_normalized_feature_select_<LEVEL>.csv.gz Batch level stacked whole plate normalized profiles that are feature selected at the batch or all plates level gct/BATCH
<BATCH>_normalized_feature_select_<LEVEL>.gct .gct file created from the <BATCH>_normalized_feature_select_<LEVEL>.csv.gz file gct/BATCH
<BATCH>_normalized_feature_select_negcon_<LEVEL>.csv.gz Batch level stacked negative control normalized profiles that are feature selected at the batch or all plates level gct/BATCH
<BATCH>_normalized_feature_select_negcon_<LEVEL>.gct .gct file created from the <BATCH>_normalized_feature_select_negcon_<LEVEL>.csv.gz file gct/BATCH
summary.tsv Summary statistics quality_control/summary
<PLATE>_cell_count.png Plate cell count quality_control/heatmap/BATCH/PLATE
<PLATE>_correlation.png Pairwise correlation between all the wells on a plate quality_control/heatmap/BATCH/PLATE
<PLATE>_position_effect.png Percent Matching between each well and other wells in the same row and column quality_control/heatmap/BATCH/PLATE

Config file

Pipeline parameters

These are the parameters that all pipelines will require

Name of the pipeline

The name of the pipeline helps distinguish the different config files. It is not used by the pipeline itself.

pipeline: <PIPELINE NAME>

Output directory

Name of the directory where the profiles will be stored. It is profiles by default.

output_dir: profiles

Well name column

Name of the well name column in the aggregated profiles. It is Metadata_well_position by default.

platemap_well_column: Metadata_well_position

Compartment names

By default, CellProfiler features are extracted from three compartments, cells, cytoplasm and nuclei. These compartments are listed in the config file as follows

compartments:
  - cells
  - cytoplasm
  - nuclei

If other 'non-canonical' compartments are present in the dataset, then those are added to the above list as follows

compartments:
  - cells
  - cytoplasm
  - nuclei
  - newcompartment

Note: if the name of the non-canonical compartment is newcompartment then the features from that compartment should begin with Newcompartment (only the first character should be capitalized). The pipeline will fail if camel case or any other format are used for feature names.

Other pipeline options

options:
  compression: gzip
  float_format: "%.5g"
  samples: all

aggregate parameters

These are parameters that are processed by the pipeline_aggregate() function that interacts with pycytominer.cyto_utils.cells.SingleCells() and aggregates single cell profiles to create well level profiles.

aggregate:
  perform: true
  plate_column: Metadata_Plate
  well_column: Metadata_Well
  method: median
  fields: all
fields:
  - 1
  - 4
  - 9

Additionally, to add whole image features to the profiles, list the feature categories to the parameter image_feature_categories. For example

image_feature_catageories:
  - Count
  - Intensity

annotate parameters

These are parameters that are processed by the pipeline_annotate() function that interacts with pycytominer.annotate() and annotates the well level profiles with metadata.

annotate:
  perform: true
  well_column: Metadata_Well
  external :
    perform: true
    file: <metadata file name>
    merge_column: <Column to merge on>

normalize parameters

These are parameters that are processed by the pipeline_normalize() function that interacts with pycytominer.normalize() and normalizes all the wells to the whole plate.

normalize:
  perform: true
  method: mad_robustize
  features: infer
  mad_robustize_fudge_factor: 0
  image_features: true

normalize_negcon parameters

These are parameters that are processed by the pipeline_normalize() function that interacts with pycytominer.normalize() and normalizes all the wells to the negative control.

normalize_negcon:
  perform: true
  method: mad_robustize
  features: infer
  mad_robustize_fudge_factor: 0
  image_features: true

feature_select parameters

These are parameters that are processed by the pipeline_feature_select() function that interacts with pycytominer.feature_select() and selects features in the whole-plate normalized profiles.

  perform: true
  features: infer
  level: batch
  gct: false
  image_features: true
  operations:
    - variance_threshold
    - correlation_threshold
    - drop_na_columns
    - blocklist

feature_select_negcon parameters

These are parameters that are processed by the pipeline_feature_select() function that interacts with pycytominer.feature_select() and selects features in the profiles normalized to the negative control.

feature_select_negcon:
  perform: true
  features: infer
  level: batch
  gct: false
  image_features: true
  operations:
    - variance_threshold
    - correlation_threshold
    - drop_na_columns
    - blocklist

quality_control parameters

These parameters specify the type of quality control metrics and figures to generate. summary generates a table with summary statistics while heatmap generates three heatmaps, each showing a different quality control metric.

quality_control:
  perform: true
  summary:
    perform: true
    row: Metadata_Row
    column: Metadata_Col
  heatmap:
    perform : true

There are two different types of qc metrics that can be generated. summary creates summary.tsv with summary statistics of each plate from each batch. heatmap generates three heatmaps - cell count vs. platemap, correlation between the wells and position effect vs. platemap.

summary:
  perform: true
  row: Metadata_Row
  column: Metadata_Col
heatmap:
  perform : true

batch and plates parameters

These parameters specify the name of the batch and plate to process.

batch: <BATCH NAME>
plates:
  - name: <PLATE NAME>
    process: true
process: true

Rerunning the pipeline

git submodule update --init --recursive
dvc pull

Using DVC

Additional information about using DVC that you may find useful:
When handling large files or a large folder, do NOT add them to GH with git add. Instead, add them to DVC with dvc add. This uploads the large file/folder to S3 and creates a pointer to that upload on S3 in the GH repo (that we track instead of the file/folder itself). It also updates .gitignore so that GH doesn't track the large file/folder itself. Then dvc push to upload the files to S3.

# Add a file or folder to DVC
dvc add LARGEFILE.csv
dvc push

Then add the .dvc version of the file/folder that is created to github along with the .gitignore. Commit.

git add LARGEFILE.csv.dvc
git add .gitignore
git commit -m "add largefile"

Download data stored by DVC in S3:

# Download ALL data stored by DVC in S3
# Only do this if you really, truly want ALL the data
 dvc pull
 # Download a specific file stored by DVC in S3
 dvc get https://github.com/ORGANIZATION/DATA-REPO.git relative/path/to/LARGEFILE.csv

DVC makes files names into hashes in S3. To see the file hash (so you can find it directly on S3) for any given DVC file add the --show-url flag to the get command:

  dvc get --show-url https://github.com/ORGANIZATION/DATA-REPO.git relative/path/to/LARGEFILE.csv