chanzuckerberg / single-cell-curation

Code and documentation for the curation of cellxgene datasets
MIT License
38 stars 24 forks source link

Add suspension_type #227

Closed jahilton closed 2 years ago

jahilton commented 2 years ago

Researchers need to be aware of which data are single-cell vs single-nucleus so that they can take special care to ensure proper integration. Most contributors don't provide it automatically because it usually isn't a variable in their study (they either have all sc or all sn). However, when a study does include sn + sc data, it is of significant discussion... https://doi.org/10.1038/s41598-020-58327-6 https://doi.org/10.1371/journal.pone.0209648 https://doi.org/10.1002/hep4.1854

This was also supported during a recent data integration workshop. Data integrators were given datasets in schema 2.0.0 and one of the first tasks they did was to curate each dataset as single-cell or single-nucleus.

Proposal Lattice submits data as obs.suspension_type with values either cell or nucleus Spatial data will need to be considered for when multiple cells are likely captured (spot is common terminology for Visium - ref, while bead is common for Slide-seq - ref)

ambrosejcarr commented 2 years ago

@jahilton could you think of something more descriptive than suspension_type that we might call the field? For example:

biological_input_type = {cell, nucleus, spot, bead}

jahilton commented 2 years ago

@jychien made a great point that spot vs bead isn't a useful distinction, so it's acceptable to have something that describes cell/nucleus but is na for spatial data. (Also prevents us from keeping up on the terminology with each new spatial assay)

In relation to the matrix, these are orienting consumers to the observations, so observation_type or unit_of_observation From more of the experimental viewpoint, isolated_anatomical_unit

brianraymor commented 2 years ago

@jahilton to add more details about assay dependencies. @BAevermann to review for spatial cases and add details about prior art.

jahilton commented 2 years ago

assay-observation_type Dependencies

BAevermann commented 2 years ago

Jasons mapping above looks fairly comprehensive. Being a bit more general, I would add:

smFISH [EFO_0009918] + its descendants should be na spatial transcriptomics by high-throughput sequencing [EFO_0030005] + its descendants should be na

brianraymor commented 2 years ago

@ambrosejcarr @jahilton - not sure that there's much value in reusing NCIT, but documenting it:

Single Cell Specimen Single Nucleus Specimen

Also see Biospecimen

brianraymor commented 2 years ago

Guidelines for reporting single-cell RNA-Seq experiments

Single Cell Isolation
Single cell entity The type of single cell entity derived from isolation protocol e.g. "whole cell", "nucleus", "cell-cell multimer", "spatially encoded cell barcoding".
brianraymor commented 2 years ago

I'm not finding a source that is definitive for mapping assay-observation_type dependencies for the validator. We can certainly provide examples, but validation would be brittle.

@jahilton - just blue-skying, what would you think of:

We define a file format that maps EFO assay to observation_type values. Lattice maintains the file. The validator reads the file and:

  1. Succeeds and sets the value in the dataset for the curator if possible. cite-seq would be "cell".
  2. Fails if the assay is not defined in the file. Lattice updates.
  3. Fails if no value has been set by the curator and the value could be "cell" or "nucleus" because the assay allows either.
jahilton commented 2 years ago

That doesn't feel right. We'd be updating the validation rules without revalidating existing datasets. I also understand not wanting to hard-code these in the validator given that they'll almost certainly be insufficient even for current assays/terms, but definitely as new assays/terms are developed.

So it's almost definitely going to be a mapping that we maintain. Just need to weigh whether that mapping should be consulted by the validator or if we should include this check in our curation 'on the side'. Currently leaning towards the latter.

brianraymor commented 2 years ago

That doesn't feel right. We'd be updating the validation rules without revalidating existing datasets.

Not exactly since the validation rules would indicate that the validator uses the information in the referenced file.

Under what conditions (besides a bug in the file like "cite-seq -> nucleus") would re-validation be required from your perspective?

jahilton commented 2 years ago

Under what conditions (besides a bug in the file like "cite-seq -> nucleus") would re-validation be required from your perspective?

I guess it would be the case where we tighten a given mapping. We think an assay could be cell or nucleus, but later dive deeper and discover that it's only for one.

brianraymor commented 2 years ago

Even assuming no automation, curators would still need to identify and revise all datasets that specified the assay with the tightened mapping.

With automation:

  1. Lattice updates the mapping from "any" to "cell".
  2. Lattice identifies all impacted datasets.
  3. Lattice downloads and strips observation_type
  4. Lattice re-validates and resubmits.
  5. Bob's your uncle.

Or we script the change in the portal and force an overwrite on the list of datasets that meet the criteria.

brianraymor commented 2 years ago

EFO defines a single cell isolation protocol but offers no equivalent for single cell nucleus isolation protocol.

jychien commented 2 years ago

Some thoughts on naming of field:

As for validation, I am usually in favor of automating as much as we can. But, as I am going through adding new assays to EFO, the ontology is still a work in progress and takes time to get those changes in. What would be a nice middle ground is to automate using the current list of dependencies, and only update the list of dependencies during migration. That way, the datasets would all be uniformly validated with the newly pinned EFO. I mean, even with automated validation, there is no way to be completely up to date with assays, so curators will need to keep an eye out. Luckily, cell vs nuclei is one of those fields that should be easily recognizable if incorrect by data contributors, so, not toooo worried about getting it wrong.

brianraymor commented 2 years ago

Clarifying - would your suggestion be that the field must only be present and named more appropriately for applicable assays - avoiding the "na" that was introduced by this comment?

jychien commented 2 years ago

My first bullet point would be the reason behind why I think observation_type or unit_of_observation could lead to confusion for capturing cell vs nuclei metadata. For ease of data integration, I think the field should be present in all datasets. After thinking about it some more, I am thinking the options are:

I am leaning towards the concept of biological_input_type

brianraymor commented 2 years ago

@ambrosejcarr @jychien - what would be the difference between a Biospecimen and a biological input type?

There is also a Tissue Section available in the reference that I shared above. (CCF has a tissue section).

jychien commented 2 years ago

Good question. A Biospecimen would be describing the type of tissue taken as sample collection. Biospecimen would be more related to https://github.com/chanzuckerberg/single-cell-curation/issues/240. biological_input_type, as I had interpreted it, is describing the entity that went into library construction and subsequent sequencing. Maybe there is something less ambiguous for field name? library_construction_input_type? Seems really long and wordy, though. For values, may be more accurate as cell suspension, nuclei suspension, or tissue section.

brianraymor commented 2 years ago

Reads:

Human biospecimens are biological materials that are obtained from living or deceased human subjects. Biospecimens are commonly also referred to as biological specimens, biological samples, biosamples or samples. All of these terms are used interchangeably.

OK then.

Would a Single Nucleus Specimen defined as A biospecimen that contains the contents of a single nucleus. be the result of some disassociation (or isolation) protocol and would subsequently be the input for the library construction? Otherwise, I'm trying to understand the context for the definition of the term. It seems quite close to the entity definition in the guidelines above.

Also, Does the DCP schema model this as cell_suspension + disassociation_protocol?

jahilton commented 2 years ago

I don't believe the DCP models the cell/nucleus suspension at all. Even single-nucleus suspensions are captured as 'cell_suspension' objects.

For this cellxgene field, I start with what Users want - they just want to know cell or nucleus. Some protocols don't fit into those 2 terms, and users don't need any additional information for those, so add na and there's your enum - cell, nucleus, or na. A property with a tissue section value doesn't really make sense because it isn't useful information. Also it isn't a parallel term as the section is the whole dataset, while the cell/nucleus is each observation in the dataset.

So with an enum of cell, nucleus, na, the property names with "observation" or "input_type" don't really make sense because then na is...well, it's not applicable because the spatial assays do have observations and inputs. The suspension/dissociation/isolation terms are more of what we're capturing. I like "suspension" more because it's focused on the entity, rather than the process/action/protocol. suspension_type wasn't descriptive enough, but I think no matter what we call it, it won't be descriptive enough for someone to understand what's captured just from the property name. On the flip side, everyone will immediately understand what's being captured when they see the enum no matter what we call it (even bobs_your_uncle).

jychien commented 2 years ago

Looks like DCP schema has "single cell" vs "single nucleus" information in their library protocol schema.

brianraymor commented 2 years ago

Thanks for the pointer @jychien. It's also exposed in their filter under the same name:

Screen Shot 2022-07-12 at 1 25 34 PM
brianraymor commented 2 years ago

@hthomas-czi and I agree that it makes sense to adopt the Lattice model with the addition of "na" rather than creating another variation:

    "suspension_type": {
        "title": "Suspension type",
        "description": "The type of suspension: cell or nucleus.",
        "type": "string",
        "enum": [
            "cell",
            "nucleus"
        ]
    },
brianraymor commented 2 years ago

I've created a table of our existing assays and started to assign values to assess dependencies. @jahilton @jychien - could you review for accuracy and also help me complete or extend the table?

Values:

  1. cell
  2. nucleus
  3. cell or nucleus
  4. na
Assay Value(s) Notes
10x 3' transcription profiling", cell or nucleus Will be addressed as 10x transcription profiling and its children
"10x 3' v1", cell or nucleus Will be addressed as 10x transcription profiling and its children
"10x 3' v2", cell or nucleus Will be addressed as 10x transcription profiling and its children
"10x 3' v3", cell or nucleus Will be addressed as 10x transcription profiling and its children
"10x 5' transcription profiling", cell or nucleus Will be addressed as 10x transcription profiling and its children
"10x 5' v1", cell or nucleus Will be addressed as 10x transcription profiling and its children
"10x 5' v2", cell or nucleus Will be addressed as 10x transcription profiling and its children
'10x scATAC-seq', nucleus child of scATAC-seq
'10x technology', cell or nucleus or na?
'CEL-seq2', cell or nucleus?
'DroNc-seq', nucleus
'Drop-seq', cell
'MERFISH', na Will be addressed as smFIsh and its children
'Patch-seq', cell
'Seq-Well', cell
'Slide-seq', na Will be addressed as spatial transcriptomics by high-throughput sequencing and its children
'Smart-seq', cell or nucleus Will be addressed as Smart-like and its children
'Smart-seq2', cell or nucleus Will be addressed as Smart-like and its children
'Visium Spatial Gene Expression', na Will be addressed as spatial transcriptomics by high-throughput sequencing and its children
'microwell-seq', cell
'scATAC-seq', nucleus
'sci-RNA-seq', cell or nucleus
'snmC-seq' nucleus

Others from the comments above:

Assay Value(s) Notes
CITE-seq, cell and all its children
sci-Plex, nucleus
snmC-seq2, nucleus
jahilton commented 2 years ago

Did a pass. 10x scATAC-seq and scATAC-seq could be addressed by ATAC-seq [EFO:0007045] and its children Searched for CEL-seq2 and certainly can be "cell" but found no definitive information about single nucleus potential, so I'm guessing its "cell or nucleus"

@jychien can you review?

jychien commented 2 years ago

Agree with @jahilton that Cel-seq2 could potentially be adapted for nuclei. The journal mentions that CEL-Seq2 is compatible with different platforms, keeping it open to the potential of implementing CEL-Seq2 with nuclei. The rest looks good to me.

And FYI, as for extending the validation table, I am working on a few datasets with assays yet to be in efo. Phenocycler/CODEX has been added, but will not appear until next efo release (https://github.com/EBISPOT/efo/issues/1630). In the interim, I am using 'protein assay'. Unclear as to when this collection will be published. We can just have validation skip these terms.

brianraymor commented 2 years ago

@jychien - It looks like EFO releases on a monthly cadence. Will this appear in the July release? We'll probably want until the last moment possible to update the pinned ontologies.

brianraymor commented 2 years ago

@jychien @jahilton - would it be reasonable to enforce "cell" for CEL-Seq2 for 3.0.0 and revisit when needed?

jahilton commented 2 years ago

I don't see any benefit to that approach. At worst, it confuses someone with single-nucleus CEL-Seq2 data. Any reason to not allow cell or nucleus?

brianraymor commented 2 years ago

I was responding to Jenny's "could potentially":

Agree with @jahilton that Cel-seq2 could potentially be adapted for nuclei. The journal mentions that CEL-Seq2 is compatible with different platforms, keeping it open to the potential of implementing CEL-Seq2 with nuclei.

which sounded like "not yet but maybe some day".

And if "some day" arrived, then we could update the schema to celebrate.

brianraymor commented 2 years ago

@jahilton @jychien

It appears that 10x technology can set to any value - cell, nucleus, or na, since it includes ATAC and Visium Spatial?

I guess that also begs the question - under what circumstances is this assay being used rather than a more accurate term?

jahilton commented 2 years ago

for CEL-seq2, if we start with "cell" based on current best knowledge, what will it take to allow submission if someone hands us single-nucleus CEL-seq2 data?


It appears that 10x technology can set to any value - cell, nucleus, or na, since it includes ATAC and Visium Spatial?

Order of logic matters here. If you first pull out the descendants of ATAC (as nucleus) and descendants of spatial (as na), then the remaining 10x technology descendants would be cell or nucleus


under what circumstances is this assay being used rather than a more accurate term?

Current cases are 10x multiome (assay isn't in pinned ontology so they will be updated) and RNA-seq data (via integration datasets) that are known to be 10x but unknown 3' or 5' or which kit version, and I imagine we'd push-back/reject these in current times ...we should review those as I am guessing we can narrow them to at least 10x transcription profiling

brianraymor commented 2 years ago

Here's the first draft for the schema section.

In theory, the cellxgene-schema CLI could automate annotation for cases which accept a single value such as "cell"; however, this would be unlike the other fields where all annotation is assigned to either the curator or the portal. Foolish consistency?

suspension_type

Key suspension_type
Annotator Curator
Value categorical with str categories. This MUST be "cell", "nucleus", or "na".

This MUST be the correct type for the corresponding assay.

For Assay MUST Use
10x transcription profiling [EFO:0030080] and its children "cell" or "nucleus"
ATAC-seq [EFO:0007045] and its children "nucleus"
CEL-seq2 [EFO:0010010] "cell" or "nucleus"
CITE-seq [EFO:0009294] and its children "cell"
DroNc-seq [EFO:0008720] "nucleus"
Drop-seq [EFO:0008722] "cell"
microwell-seq [EFO:0030002] "cell"
Patch-seq [EFO:0008853] "cell"
sci-RNA-seq [EFO:0010550] "cell" or "nucleus"
sci-Plex [EFO:0030026] "nucleus"
Seq-Well [EFO:0008919] "cell"
Smart-like [EFO:0010184] and its children "cell" or "nucleus"
smFISH [EFO:0009918] and its children "na"
snmC-seq [EFO:0008939] "nucleus"
snmC-seq2 [EFO:0030027] "nucleus"
spatial proteomics [EFO:0700000] and its children "na"
spatial transcriptomics by high-throughput sequencing [EFO:0030005] and its children "na"


brianraymor commented 2 years ago

We should also add a new assays that we have been waiting for and which will appear in the update to the pinned EFO ontology.

brianraymor commented 2 years ago

To help with evolution of the validation, the validator should warn when it does not find a match in the table. Then we can create a tracking issue to add the assay to the table in the next version.

jahilton commented 2 years ago

New assays that we'll add with the ontology bump (via #205 )

jahilton commented 2 years ago

The CLI idea seems like more work than what it's worth

If we did go that route, then cases like CEL-seq2 should certainly start with the lenient (cell or nucleus) option. Otherwise, we leave it open to a curator not catching when someone does adapt it for nucleus.

brianraymor commented 2 years ago

The CLI idea seems like more work than what it's worth

Noted.

add a rule for EFO:0800000 'spatial proteomics' (incoming term for the next release) & its descendants --> na

Updated the table above. Fingers crossed that EFO releases before we update the pinned ontologies.