AlexsLemonade / scpca-nf

scpca-nf is the Nextflow workflow for processing Single-cell Pediatric Cancer Atlas Portal data
BSD 3-Clause "New" or "Revised" License
12 stars 2 forks source link

Add Cell level metadata to AnnData objects #366

Closed allyhawkins closed 1 year ago

allyhawkins commented 1 year ago

To make our AnnData output as compliant as possible with CZI, we need to update the existing cell metadata present in the AnnData objects to include the necessary entries. The full documentation can be found here: https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.0.0/schema.md#obs-cell-metadata

In brief we will need to add:

We also need to include cell type ontology, but that is a separate addition we are tracking.

One thing to think about as we are doing this, is if we want to add in these items to the SCE objects before conversion or if we want to add them into the AnnData after conversion. I do think when we generate merged SCE objects some of these things we may want (like disease, tissue, sex) to be in our SCE objects. So maybe instead of including a process to do that in AnnData we just apply it to SCE objects.

allyhawkins commented 1 year ago

Now that we are incorporating the sample level metadata as part of the unfiltered SCE objects, this data will already be present in the AnnData.uns slot. We will need to take the list of metadata present and use it to populate the cell metadata columns that are required by CZI.

allyhawkins commented 1 year ago

Before I start to implement these changes, I wanted to outline my ideas for implementation. Right now, the SCE objects each have sample_metadata stored in the SCE object as a single-row data frame. The contents of the sample_metadata need to be stored as columns in the cell-level metadata in the AnnData object. Additionally, we need to pull out the 10X kit and suspension type from the library_metadata to add to the cell-level metadata.

I think that means we need to add two steps to the script that converts SCE objects to AnnData. So the new sce_to_anndata.R would have the following outline:

  1. Grab library metadata for 10X kit and suspension type and add to colData (new function in scpcaTools or we could just keep this within the script run in the workflow)
    1. I think we only want the CZI required columns to be added here, so assay and suspension_type
  2. Add sample metadata to colData (new function in scpcaTools)
    1. This would take all columns of the sample_metadata and add them as new columns in colData
  3. Convert SCE to AnnData (already exists)

We also need to consider how we want the CITE-seq or cell hashing datasets to look. Do we want to also add the same contents to the cell-level metadata for those objects? I am thinking we want the contents of those files to mirror the RNA files so we will need to take the same steps for both RNA and feature objects.

Tagging @jashapiro for any feedback on this approach.

allyhawkins commented 1 year ago

All of the items in the checklist have been added so I'm going to close this.