cellannotation / cas-tools

Cell Annotation Schema Tools
1 stars 0 forks source link

Add support for breaking up anndata files into separate files #109

Closed ubyndr closed 1 month ago

ubyndr commented 1 month ago

MVP Specification for Splitting AnnData Based on Cell IDs from a CAS JSON File

Objective

To create a function that splits an AnnData object into multiple subsets based on cell IDs provided in a CAS JSON file. The resulting subsets should retain the integrity of the original data structure, including all relevant annotations.

Requirements

  1. Input Data:

    • An AnnData object containing single-cell RNA sequencing data.
    • A CAS JSON file containing the cell IDs for each subset in its annotations field.
  2. Output:

    • Multiple AnnData objects, each corresponding to a set of cell IDs specified in the CAS JSON file.

JSON File Structure

The CAS JSON file should follow this structure:

{
  "author_name": "John Doe",
  "annotations": [
    {
      "labelset": "CrossArea_cluster",
      "cell_label": "VC_1",
      "cell_set_accession": "CrossArea_cluster:5b3c85bd",
      "cell_fullname": "VC_1",
      "parent_cell_set_name": "VC",
      "parent_cell_set_accession": "CrossArea_subclass:add985a2",
      "user_annotations": [
        {
          "labelset": "history",
          "cell_label": "None"
        }
      ],
      "cell_ids": [
        "CTAGACAAGTTCATCG",
        "GCTTGGGCAACGCCCA",
        "TTCCGGTGTAGCGAGT",
        "TTGGGCGCAGCAGTCC",
        "ATCCCTGAGCAACAGC",
        "CTTCGGTGTCCCGGTA",
        "TAAGCCACACAGCATT"
      ]
    }
  ]
}

Function Specification

Steps

  1. Load the AnnData object.
  2. Load and parse the JSON file.
  3. For each annotation specified in the JSON file:
    • Extract the cell_label and cell_ids.
    • Filter the AnnData object to include only the cells listed in cell_ids.
    • Create a new AnnData object for the subset.
  4. Write AnnData objects to files
ubyndr commented 1 month ago

cc @dosumis @hkir-dev