chanzuckerberg / single-cell-curation

Code and documentation for the curation of cellxgene datasets
MIT License
38 stars 24 forks source link

Add "author contributed cell type label" #414

Closed BAevermann closed 12 months ago

BAevermann commented 1 year ago

Discuss/investigate the feasibility and user desire for a new schema field that captures the "author contributed cell type labels":

obs (Cell Metadata)

Key: ? (Author Contributed Cell Type) Value: str. Status: Required Definition: A free text field capturing the authors cell type labels/annotation. Note these terms would not validated or standardized.

Background

Authors usually provide cell type labels that contain:

  1. New cell types not in Cell Ontology
  2. Cell states
  3. Parent nodes information (levels of hierarchy)

This investigation will begin with considering the most granular, leaf node, labels for inclusion in the schema (1 & 2 above).

Users and Use Cases

Data consumers have requested the ability to access the future "Cell Census" via filtering by Cell Ontology term and obtaining all the author contributed cell type labels in order to perform downstream analyses. For instance David Osumi-Sutherland and Tiago Lubiana would use these mappings to review opportunities for new CL terms and Evan Biederstedt at CAP would use them in a similar way to improve their annotation capture tooling. Lastly, Katy Boerner would like to use these labels for data modeling/analysis.


@brianraymor writes:

I suspect that we could simply reference the author defined category field for their cell types in uns and not use obs for this mapping. It's important that we do not blur the Author Categories and the Standard Categories in Explorer.

See related conversation in #cellxgene-users.

ambrosejcarr commented 1 year ago

A few questions for alignment.

Definition: A free text field capturing the authors cell type labels/annotation. Note these terms would not validated or standardized.

There will be cases when the most specific provided labels are the CL labels we request, and that should not create a failure mode (or duplicate information). Am I correct that "" or similar "null" value would be permitted?

Key: ? (Author Contributed Cell Type)

Reviewing the use cases, it sounds like different information (type, state) is desired by different users. Would it be more accurate to name this field "Auxiliary cell information" and provide some guidelines about how authors could populate it?

BAevermann commented 1 year ago

Great questions.

What value goes in this field when the CL term is the same as the Author label is interesting. Typically the author labels are what they are using in their lab/manuscript to most comprehensively describe their biological findings. I would hope that if the CL term is the most specific, that the author would adopt that term. As such, my default was to expect the CL term to show up as the author label which would indicate that there are no states or more specific types to consider. I can see how this would lead to a degree of redundancy between the "cell type" field and the "Author label" field, and that "null" or "" would be cleaner from a databasing perspective; however, this would clash with data submitters use of these labels as their complete annotation. Lastly, as we wont be validating, by recommending the use of "null" or "" we will inevitably end up with some submitters who include the CL and some that don't, which could be confusing to the final data consumers.

I am open to any suggestions for the key label. I like that "Auxiliary cell information" does explicitly refer to cell types, but I am also hoping for the key to indicate that these are the authors/experts original preferred annotation. Perhaps "Author cell information" ?

ambrosejcarr commented 1 year ago

I like that "Auxiliary cell information" does explicitly refer to cell types, but I am also hoping for the key to indicate that these are the authors/experts original preferred annotation. Perhaps "Author cell information" ?

My intention was to make the label less specific so that it could capture both type and state. "Author cell information" is good, "Auxiliary" was meant to indicate that it's a secondary field to the schema's cell type field, to minimize:

Lastly, as we wont be validating, [...] we will inevitably end up with some submitters who include the CL and some that don't.

That said, I'm not attached to "auxiliary". Perhaps there are labels that capture both concepts?

jahilton commented 1 year ago

Something to consider during exploration - how spread out is "author cell information" across the cell metadata? Are cell type & state in different fields? Are there multiple cell types (eg general, class, subclass)? And would the proposed addition request that all of those be collapsed into 1? Provide guidance for submitting multiple levels of resolution (eg cell_info_level0 is required, but can submit cell_info_level1 etc.)?

BAevermann commented 1 year ago

Of the ~25 collections I looked at, I did not see states being separated from "author_cell_type" labels.

There are lot of datasets with various levels of hierarchy represented in the author contributed labels (anno_l1, anno_l2, etc...). DOS suggested possibly concatenating them to capture all the author information. For about half the datasets per tissue investigated, the author labels demonstrated no value beyond the CL terms already provided. Those that did have additional labels were often abbreviated and not immediately interpretably (at to non-tissue expert).

jahilton commented 1 year ago

For about half the datasets per tissue investigated, the author labels demonstrated no value beyond the CL terms already provided. Those that did have additional labels were often abbreviated and not immediately interpretably (at to non-tissue expert).

One thought is that even when they're not useful above the standardized cell_type is that users may be looking for them regardless. So it may be worthwhile to find some solution that delivers them to a consistent location for all datasets (top of "Author categories"?)

BAevermann commented 1 year ago

Very true. its a common request when I am presenting posters, etc.. So I agree that users have an assumption that "author contributed labels" exist.

One issue I have had was that my assumption about these labels was wrong I assumed that these labels would correspond to the embeds, e.g., that they would be closely correlated to the clustering solution. In papers, there is often the parlance of "Cluster_#_Celltype_Gene" and I'd assumed that thoese rough annotations would be provided. Sadly, I have not seen that "science in action" type of annotation.

dosumis commented 1 year ago

My preferred solution would be to flag user contributed cell type fields while keeping the original field names. This can be done with a metaschema - a small blob of standardised JSON in obs or uns. A metaschema like this in obscan be used to record evidence and provenance for annotation - e.g. recording the algorithm and reference data used in annotation projection. I am working on a proposal (& accompanying supporting python lib) for this. I can share the proposal shortly.

jahilton commented 1 year ago

@dosumis can you explain the value of maintaining the original field names?

dosumis commented 1 year ago

@dosumis can you explain the value of maintaining the original field names?

1. There may be many fields with cell type/state info in free text, with different granularities, sometimes specialised for abbreviations, sometimes for full names. e.g.

https://cellxgene.cziscience.com/e/0b75c598-0893-4216-afe8-5414cab7739d.cxg/

Given that, how would you extend your standard to support an indeterminate number of additional free text fields? This could potentially be done with some naming convention, but this feels messy to me & there will always be a temptation to store even more info in key strings by convention. Would you include conventions for recording extra info (state vs type, relative granularity?). This can easily and cleanly be achieved in a metaschema if desired. Would any such convention end up being more of a burden on submitters than flagging fields?

2. Author/community preference

a. Field names often refer directly to information in referenced papers and understood by a broader sub-community - so keeping them makes it easier to work back and forth between the paper and the dataset. e.g. Brain Initiative often use standard names for granularity levels: class; subclass; cluster reflecting this in key names of submission + accompanying papers

b. It is my understanding that most matrices have already been annotated by users once they reach CxG. Keeping key names makes it easy for authors and any community around them that has access to presubmission datasets and analysis to relate what's on CxG to their own datasets and analyses.

Given a.&b. I think it's likely that most authors and many users of CxG would prefer to keep those names.

3. Recording evidence for cell type assignment will require a metaschema anyway

brianraymor commented 12 months ago

Per November 27 triage, this design is not being pursued further. See @BAevermann earlier analysis - https://github.com/chanzuckerberg/single-cell-curation/issues/414#issuecomment-1515559005.