Closed jahilton closed 2 years ago
@jahilton could you think of something more descriptive than suspension_type
that we might call the field? For example:
biological_input_type = {cell, nucleus, spot, bead}
@jychien made a great point that spot vs bead isn't a useful distinction, so it's acceptable to have something that describes cell/nucleus but is na
for spatial data. (Also prevents us from keeping up on the terminology with each new spatial assay)
In relation to the matrix, these are orienting consumers to the observations, so observation_type
or unit_of_observation
From more of the experimental viewpoint, isolated_anatomical_unit
@jahilton to add more details about assay dependencies. @BAevermann to review for spatial cases and add details about prior art.
assay-observation_type Dependencies
EFO:0007045
] should be nucleus
EFO:0009920
] should be na
EFO:0030062
] should be na
EFO:0010961
] should be na
EFO:0009294
] + its descendants should be cell
EFO: 0008939
] should be nucleus
EFO:0030027
] should be nucleus
EFO:0008720
] should be nucleus
EFO: 0030026
] should be nucleus
Jasons mapping above looks fairly comprehensive. Being a bit more general, I would add:
smFISH [EFO_0009918] + its descendants should be na spatial transcriptomics by high-throughput sequencing [EFO_0030005] + its descendants should be na
@ambrosejcarr @jahilton - not sure that there's much value in reusing NCIT, but documenting it:
Single Cell Specimen Single Nucleus Specimen
Also see Biospecimen
Guidelines for reporting single-cell RNA-Seq experiments
Single Cell Isolation | |
---|---|
Single cell entity | The type of single cell entity derived from isolation protocol e.g. "whole cell", "nucleus", "cell-cell multimer", "spatially encoded cell barcoding". |
I'm not finding a source that is definitive for mapping assay-observation_type dependencies for the validator. We can certainly provide examples, but validation would be brittle.
@jahilton - just blue-skying, what would you think of:
We define a file format that maps EFO assay to observation_type
values. Lattice maintains the file. The validator reads the file and:
That doesn't feel right. We'd be updating the validation rules without revalidating existing datasets. I also understand not wanting to hard-code these in the validator given that they'll almost certainly be insufficient even for current assays/terms, but definitely as new assays/terms are developed.
So it's almost definitely going to be a mapping that we maintain. Just need to weigh whether that mapping should be consulted by the validator or if we should include this check in our curation 'on the side'. Currently leaning towards the latter.
That doesn't feel right. We'd be updating the validation rules without revalidating existing datasets.
Not exactly since the validation rules would indicate that the validator uses the information in the referenced file.
Under what conditions (besides a bug in the file like "cite-seq -> nucleus") would re-validation be required from your perspective?
Under what conditions (besides a bug in the file like "cite-seq -> nucleus") would re-validation be required from your perspective?
I guess it would be the case where we tighten a given mapping. We think an assay could be cell or nucleus, but later dive deeper and discover that it's only for one.
Even assuming no automation, curators would still need to identify and revise all datasets that specified the assay with the tightened mapping.
With automation:
observation_type
Or we script the change in the portal and force an overwrite on the list of datasets that meet the criteria.
EFO defines a single cell isolation protocol but offers no equivalent for single cell nucleus isolation protocol.
Some thoughts on naming of field:
observation_type
or unit_of_observation
, I think most users would not expect "na" as an option. Every matrix object should have some sort of unit or type to describe its observation. So, unless we want to use a generic spatial value (such as "spatial unit"), then these field names might not be great. suspension_type
, can try isolation_type
or dissociation_unit
(although, technically, blood does not require dissociation). I really can't think of anything better than these, unfortunately.As for validation, I am usually in favor of automating as much as we can. But, as I am going through adding new assays to EFO, the ontology is still a work in progress and takes time to get those changes in. What would be a nice middle ground is to automate using the current list of dependencies, and only update the list of dependencies during migration. That way, the datasets would all be uniformly validated with the newly pinned EFO. I mean, even with automated validation, there is no way to be completely up to date with assays, so curators will need to keep an eye out. Luckily, cell vs nuclei is one of those fields that should be easily recognizable if incorrect by data contributors, so, not toooo worried about getting it wrong.
Clarifying - would your suggestion be that the field must only be present and named more appropriately for applicable assays - avoiding the "na"
that was introduced by this comment?
My first bullet point would be the reason behind why I think observation_type
or unit_of_observation
could lead to confusion for capturing cell vs nuclei metadata. For ease of data integration, I think the field should be present in all datasets. After thinking about it some more, I am thinking the options are:
observation_type
or unit_of_observation
as the field name, and have the values as "cell", "nucleus", or "spatial unit" . The "na" was only applicable when I had thought the field was capturing the type of suspension in the assay. I still think it is too difficult to capture any specific unit for spatial data. suspension_type
and have the values as "cell", "nucleus" or "na"biological_input_type
and have values as "cell", "nucleus" or "tissue section"I am leaning towards the concept of biological_input_type
@ambrosejcarr @jychien - what would be the difference between a Biospecimen and a biological input type?
There is also a Tissue Section available in the reference that I shared above. (CCF has a tissue section).
Good question. A Biospecimen
would be describing the type of tissue taken as sample collection. Biospecimen
would be more related to https://github.com/chanzuckerberg/single-cell-curation/issues/240. biological_input_type
, as I had interpreted it, is describing the entity that went into library construction and subsequent sequencing. Maybe there is something less ambiguous for field name? library_construction_input_type
? Seems really long and wordy, though. For values, may be more accurate as cell suspension
, nuclei suspension
, or tissue section
.
Reads:
Human biospecimens are biological materials that are obtained from living or deceased human subjects. Biospecimens are commonly also referred to as biological specimens, biological samples, biosamples or samples. All of these terms are used interchangeably.
OK then.
Would a Single Nucleus Specimen defined as A biospecimen that contains the contents of a single nucleus. be the result of some disassociation (or isolation) protocol and would subsequently be the input for the library construction? Otherwise, I'm trying to understand the context for the definition of the term. It seems quite close to the entity definition in the guidelines above.
Also, Does the DCP schema model this as cell_suspension + disassociation_protocol?
I don't believe the DCP models the cell/nucleus suspension at all. Even single-nucleus suspensions are captured as 'cell_suspension' objects.
For this cellxgene field, I start with what Users want - they just want to know cell
or nucleus
. Some protocols don't fit into those 2 terms, and users don't need any additional information for those, so add na
and there's your enum - cell
, nucleus
, or na
.
A property with a tissue section
value doesn't really make sense because it isn't useful information. Also it isn't a parallel term as the section is the whole dataset, while the cell/nucleus is each observation in the dataset.
So with an enum of cell
, nucleus
, na
, the property names with "observation" or "input_type" don't really make sense because then na
is...well, it's not applicable because the spatial assays do have observations and inputs.
The suspension/dissociation/isolation terms are more of what we're capturing. I like "suspension" more because it's focused on the entity, rather than the process/action/protocol.
suspension_type
wasn't descriptive enough, but I think no matter what we call it, it won't be descriptive enough for someone to understand what's captured just from the property name. On the flip side, everyone will immediately understand what's being captured when they see the enum no matter what we call it (even bobs_your_uncle
).
Looks like DCP schema has "single cell" vs "single nucleus" information in their library protocol schema.
Thanks for the pointer @jychien. It's also exposed in their filter under the same name:
@hthomas-czi and I agree that it makes sense to adopt the Lattice model with the addition of "na"
rather than creating another variation:
"suspension_type": {
"title": "Suspension type",
"description": "The type of suspension: cell or nucleus.",
"type": "string",
"enum": [
"cell",
"nucleus"
]
},
I've created a table of our existing assays and started to assign values to assess dependencies. @jahilton @jychien - could you review for accuracy and also help me complete or extend the table?
Values:
Assay | Value(s) | Notes | |
---|---|---|---|
10x 3' transcription profiling", | cell or nucleus | Will be addressed as 10x transcription profiling and its children | |
"10x 3' v1", | cell or nucleus | Will be addressed as 10x transcription profiling and its children | |
"10x 3' v2", | cell or nucleus | Will be addressed as 10x transcription profiling and its children | |
"10x 3' v3", | cell or nucleus | Will be addressed as 10x transcription profiling and its children | |
"10x 5' transcription profiling", | cell or nucleus | Will be addressed as 10x transcription profiling and its children | |
"10x 5' v1", | cell or nucleus | Will be addressed as 10x transcription profiling and its children | |
"10x 5' v2", | cell or nucleus | Will be addressed as 10x transcription profiling and its children | |
'10x scATAC-seq', | nucleus | child of scATAC-seq | |
'10x technology', | cell or nucleus or na? | ||
'CEL-seq2', | cell or nucleus? | ||
'DroNc-seq', | nucleus | ||
'Drop-seq', | cell | ||
'MERFISH', | na | Will be addressed as smFIsh and its children | |
'Patch-seq', | cell | ||
'Seq-Well', | cell | ||
'Slide-seq', | na | Will be addressed as spatial transcriptomics by high-throughput sequencing and its children | |
'Smart-seq', | cell or nucleus | Will be addressed as Smart-like and its children | |
'Smart-seq2', | cell or nucleus | Will be addressed as Smart-like and its children | |
'Visium Spatial Gene Expression', | na | Will be addressed as spatial transcriptomics by high-throughput sequencing and its children | |
'microwell-seq', | cell | ||
'scATAC-seq', | nucleus | ||
'sci-RNA-seq', | cell or nucleus | ||
'snmC-seq' | nucleus |
Others from the comments above:
Assay | Value(s) | Notes |
---|---|---|
CITE-seq, | cell | and all its children |
sci-Plex, | nucleus | |
snmC-seq2, | nucleus |
Did a pass.
10x scATAC-seq
and scATAC-seq
could be addressed by ATAC-seq [EFO:0007045]
and its children
Searched for CEL-seq2
and certainly can be "cell" but found no definitive information about single nucleus potential, so I'm guessing its "cell or nucleus"
@jychien can you review?
Agree with @jahilton that Cel-seq2 could potentially be adapted for nuclei. The journal mentions that CEL-Seq2 is compatible with different platforms, keeping it open to the potential of implementing CEL-Seq2 with nuclei. The rest looks good to me.
And FYI, as for extending the validation table, I am working on a few datasets with assays yet to be in efo. Phenocycler/CODEX has been added, but will not appear until next efo release (https://github.com/EBISPOT/efo/issues/1630). In the interim, I am using 'protein assay'. Unclear as to when this collection will be published. We can just have validation skip these terms.
@jychien - It looks like EFO releases on a monthly cadence. Will this appear in the July release? We'll probably want until the last moment possible to update the pinned ontologies.
@jychien @jahilton - would it be reasonable to enforce "cell" for CEL-Seq2 for 3.0.0 and revisit when needed?
I don't see any benefit to that approach. At worst, it confuses someone with single-nucleus CEL-Seq2 data. Any reason to not allow cell or nucleus?
I was responding to Jenny's "could potentially":
Agree with @jahilton that Cel-seq2 could potentially be adapted for nuclei. The journal mentions that CEL-Seq2 is compatible with different platforms, keeping it open to the potential of implementing CEL-Seq2 with nuclei.
which sounded like "not yet but maybe some day".
And if "some day" arrived, then we could update the schema to celebrate.
@jahilton @jychien
It appears that 10x technology can set to any value - cell, nucleus, or na, since it includes ATAC and Visium Spatial?
I guess that also begs the question - under what circumstances is this assay being used rather than a more accurate term?
for CEL-seq2, if we start with "cell" based on current best knowledge, what will it take to allow submission if someone hands us single-nucleus CEL-seq2 data?
It appears that 10x technology can set to any value - cell, nucleus, or na, since it includes ATAC and Visium Spatial?
Order of logic matters here. If you first pull out the descendants of ATAC (as nucleus) and descendants of spatial (as na), then the remaining 10x technology descendants would be cell or nucleus
under what circumstances is this assay being used rather than a more accurate term?
Current cases are 10x multiome (assay isn't in pinned ontology so they will be updated) and RNA-seq data (via integration datasets) that are known to be 10x but unknown 3' or 5' or which kit version, and I imagine we'd push-back/reject these in current times ...we should review those as I am guessing we can narrow them to at least 10x transcription profiling
Here's the first draft for the schema section.
In theory, the cellxgene-schema
CLI could automate annotation for cases which accept a single value such as "cell"
; however, this would be unlike the other fields where all annotation is assigned to either the curator or the portal. Foolish consistency?
Key | suspension_type | ||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Annotator | Curator | ||||||||||||||||||||||||||||||||||||
Value | categorical with str categories. This MUST be "cell" , "nucleus" , or "na" .This MUST be the correct type for the corresponding assay.
|
We should also add a new assays that we have been waiting for and which will appear in the update to the pinned EFO ontology.
To help with evolution of the validation, the validator should warn when it does not find a match in the table. Then we can create a tracking issue to add the assay to the table in the next version.
New assays that we'll add with the ontology bump (via #205 )
na
The CLI idea seems like more work than what it's worth
If we did go that route, then cases like CEL-seq2 should certainly start with the lenient (cell or nucleus) option. Otherwise, we leave it open to a curator not catching when someone does adapt it for nucleus.
The CLI idea seems like more work than what it's worth
Noted.
add a rule for EFO:0800000 'spatial proteomics' (incoming term for the next release) & its descendants --> na
Updated the table above. Fingers crossed that EFO releases before we update the pinned ontologies.
Researchers need to be aware of which data are single-cell vs single-nucleus so that they can take special care to ensure proper integration. Most contributors don't provide it automatically because it usually isn't a variable in their study (they either have all sc or all sn). However, when a study does include sn + sc data, it is of significant discussion... https://doi.org/10.1038/s41598-020-58327-6 https://doi.org/10.1371/journal.pone.0209648 https://doi.org/10.1002/hep4.1854
This was also supported during a recent data integration workshop. Data integrators were given datasets in schema 2.0.0 and one of the first tasks they did was to curate each dataset as single-cell or single-nucleus.
Proposal Lattice submits data as obs.suspension_type with values either
cell
ornucleus
Spatial data will need to be considered for when multiple cells are likely captured (spot
is common terminology for Visium - ref, whilebead
is common for Slide-seq - ref)