HumanCellAtlas / ingest-central

Ingest Central is the hub repository for the ingest service
Apache License 2.0
0 stars 1 forks source link

Selected Cell Type contains unexpected information #609

Open ambrosejcarr opened 5 years ago

ambrosejcarr commented 5 years ago

The "selected cell type" metadata field that appears in the browser does not contain the information I would expect it to, or the short description for the following study is wrong.

This study is an immune cell study: https://data.humancellatlas.org/explore/projects/8c3c290d-dfff-4553-8868-54ce45f4ba7f

The study clearly states that it sorts on CD45 and separately sequences the positive and negative populations, suggesting that it has done two selections: CD45+ and CD45-.

Yet, the "selected cell type" column contains a series of immune types that presumably the authors found after sequencing, if I believe the study description. The cell types in the "selected cell type" field here are more correctly identified cell types.

Corollary: I would expect to see a number of "None" fields where authors did not perform selection prior to experimentation.

theathorn commented 4 years ago

This should be an Ingest issue - Data Browser is just reflecting what's in the metadata from cell_suspension.json: https://data.humancellatlas.org/metadata/dictionary/biomaterial/cell_suspension#cell_suspension-selected_cell_types

lauraclarke commented 4 years ago

@HumanCellAtlas/wranglers @HumanCellAtlas/data-ops not sure where it is best to track this issue, it isn't ingest-central as it isn't an ingest software issue but a data content issue

zperova commented 4 years ago

@ambrosejcarr the Extended data figure 1 in the linked publication contains information on gating strategy. You will see that sorting for CD45+/CD45- is just one part of the experiment, the second part has a much more extended gating strategy which is correctly depicted in the metadata. Description of the study is provided by contributors and I don't know why they decided to mention only the CD45+/- sorting. As @lauraclarke said there needs to be a decision of where to keep such tickets related to the data quality. From the viewpoint of metadata (displaying the selected type rather than identified type) or its display in the Data Browser, there is no issue.

ambrosejcarr commented 4 years ago

I think I understand the argument that this is not a metadata issue. I don't understand the intricacies of the tooltip and header construction but I have a few comments (possibly for the service this gets moved to). :)

  1. Is the "selected cell" a mandatory field in the metadata? If so, we should replace "unspecified" with "no selection" or "none". Generally, we should be confident about making negative statements if our data wrangling contract supports those assertions.
  2. Selected Cell Type -> Selected Cell Type(s); make it clear there can be multiple in the header
  3. Selection: For the mouse neuron study it's much more important to me to know that cells were selected with NeuN (missing) than to know that the author intended to select for neurons (provided). The latter is open to interpretation and opinion, a marker selection is not. The melanoma dataset linked above is a nice compromise, having both gating scheme and marker.
  4. Absent author-provided cell type annotations, I'm a bit uncomfortable with publishing fine typing information. I know this isn't actionable; I don't have a good solution to suggest.
ESapenaVentura commented 4 years ago

@ambrosejcarr, about your points:

  1. The field is not mandatory, so it can be empty.
  2. The field name is already types in the metadata schema. Are you referring to the column displayed in the browser?
  3. We have a markers field in the enrichment protocol schema. I am not too familiar with this dataset but if those cells were selected for NeuN and it is important, it should probably be reflected there.

Comments 1 and 2 seem like something that the wranglers can do nothing about.

About comment 3, I am not entirely sure of this, but I think this field was thought of more of a general field where to input the expected type of cells in the suspension, previous to sequencing (e.g. there might be no enrichment, but the cell suspension was obtained from a chunk of muscle tissue so you'd expect muscle fibroblast).

I hope I have understood the points and that this helps shed some information on this issue.

ambrosejcarr commented 4 years ago

Thanks for the quick response @ESapenaVentura.

  1. The field name is already types in the metadata schema. Are you referring to the column displayed in the browser?

Yes, the browser display.

  1. We have a markers field in the enrichment protocol schema. I am not too familiar with this dataset but if those cells were selected for NeuN and it is important, it should probably be reflected there.

That's great that we capture this information. As a scientist, when I look at "selected cell type", I see "this is what the authors think is in their data". When I see "selected markers" I know what's in their data. I think the latter enables more precise reasoning about the data and I'm uncomfortable exposing the former, particularly given the lack of any links out to definitions for "what selected cell type" means.

To use a contrived example, Imagine I submitted data and stated that my selected cell type was "Ambrose Cells". This is presumably not very useful to anyone but me and maybe my collaborators. But if you tell the world that's just an EPCAM+ cell, suddenly they can parse the my incoherent naming conventions.

Comments 1 and 2 seem like something that the wranglers can do nothing about.

Great, no sweat -- I'm trying to look at the DCP as a whole from the perspective of a scientist -- I don't mean to imply work needs to be done by any component, I'm just trying to opportunities for improvement and I trust the teams to find the right home for these observations. If you'd like me to separate things or open new tickets I'm open to that.

I hope I have understood the points and that this helps shed some information on this issue.

Yep, definitely. Appreciate your time!

mshadbolt commented 4 years ago

I would like to give our contributors a bit more credit that they wouldn't just make up cell types. Also this field is ontologised so that the names in this field will always be coherent and well defined.

I agree markers can be considered more accurate and are great for scientists who are familiar with what markers identify a cell. But I am also sure that there are a lot of scientists that would be more familiar with cell type names than trying to work out a cell type from a list of cell surface markers.

So I see your point about potential for inaccuracy but also see the benefit of giving consumers a quick and easy to interpret idea of the kind of cells that are in a dataset. Perhaps we just need a better definition for what this field is actually trying to capture.

ambrosejcarr commented 4 years ago

These are great points. I have some specific comments:

I would like to give our contributors a bit more credit that they wouldn't just make up cell types.

Indeed, my example was contrived. I'll never be one to disparage our scientific collaborators and appreciate the tone of your response. However, discovering new cell types is the goal of the atlas, so I disagree with this assertion for this reason and because there are disagreements on how to define some types. Sorry for the original example, In hindsight I suspect I probably would have responded the same way you did.

To provide a more concrete example, there was a discussion at a recent conference about how to differentiate an exhausted T cell (not in the ontology!) from an activated T cell (very well specified, ontology includes markers!), with two very prominent and respected scientists taking very different and well reasoned positions. It turns out that these measure very similar markers, but there are some differences that could help scientists understand why the argument exists.

Also this field is ontologised so that the names in this field will always be coherent and well defined.

This is great! However, I would argue that users need to know (1) that this is field ontologised, (2) how each cell in the ontology is defined. I can't always find that data from the cell ontology, so there might even be an opportunity to use the correspondence between the sorted fields and expected cell types to refine the information present in the cell ontology.

So I see your point about potential for inaccuracy but also see the benefit of giving consumers a quick and easy to interpret idea of the kind of cells that are in a dataset. Perhaps we just need a better definition for what this field is actually trying to capture.

Yep, that's a great point, Perhaps a balance could be to explore presenting both fields. I would also suggest thinking about the pros and cons of linking from the ontology term to the definition, or making it clearer how one can find that information. I'm not looking to dictate specific outcomes, just highlighting that I find it difficult to understand which projects I'd want to download based on the current presentation.