chanzuckerberg / single-cell-curation

Code and documentation for the curation of cellxgene datasets
MIT License
37 stars 23 forks source link

Handling of multi organism datasets #450

Closed brianraymor closed 10 months ago

brianraymor commented 1 year ago

@Yanay1 commented on Wed May 10 2023

Description

There are a few datasets that have multiple organisms in them, and this may increase over time.

Context

Here are two examples of datasets with multiple organisms in them:

1: https://cellxgene.cziscience.com/collections/367d95c0-0eb0-4dae-8276-9407239421ee

2: https://cellxgene.cziscience.com/collections/0a77d4c0-d5d0-40f0-aa1a-5e1429bcbd7e

In the first example, the species are mixed together in the anndata and it is unclear how the gene subset was determined.

In the second example, the species are separate and downstream users can do any kind of multi-species integration they want using these anndatas, rather than having to find the raw, un-joined data.

Impact

Each anndata should have one species only. This would improve the ability to do multi-species analysis using the census.

Alternatives you've considered

When the species are joined and some kind of common name (or ortholog) gene subset is used, there is no way to reverse that. The only alternative is to go to the original paper's data source.

Ideal behavior

Each anndata should have one species only.


@pablo-gar commented on Sun May 14 2023

Hi @Yanay1

This relates to the data schema we have for the source h5ads.

What you are seeing is a direct consequence of certain level of flexibility we have in the schema. We allow data contributors to submit datasets in any form that fits their publication needs so long as:

This has lead t the following cases:

We try to stay away from datasets for which the gene space and the cells are from different species, and therefore we discourage submission of multi-species dataset. But since we don't enforce this we do have a few of these cases.

I'm tagging @brianraymor since he is the owner of our source h5ad schema @brianraymor.


Please note that Census data (not the source h5ads) does enforce that only cells that have a matching species in their genes are included, so your issue does not extend to the Census data itself.

jahilton commented 1 year ago

it is unclear how the gene subset was determined

This is actually true for all CELLxGENE datasets. Our our contributing guidelines include "preference is that gene have not been filtered in order to maximize future data integration efforts" and schema states "genes SHOULD NOT be filtered from either dataset". But we do not track any provenance of the gene set.

Given that we only accept human or mouse genes, I think the ask here is that if a contributor wishes to submit mouse & human cells in 1 Dataset, then we should STRONGLY RECOMMEND that the Collection includes at least 1 Dataset with the unfiltered human gene set and 1 Dataset with the unfiltered mouse gene set in order to enable reuse of either.

BAevermann commented 1 year ago

How do we handle the non-human or mouse data? We cannot accept the raw datasets either because we do not support their annotation or the genes have been prefiltered and lifted to human/mouse annotation. As a consequence, they continue to haunt the dataset they live in ....

pablo-gar commented 1 year ago

How do we handle the non-human or mouse data?

We take them in with human genes, see for example the Sus scrofa domesticus dataset here:

https://cellxgene.cziscience.com/collections/0a77d4c0-d5d0-40f0-aa1a-5e1429bcbd7e

brianraymor commented 1 year ago

How do we handle the non-human or mouse data?

Per the schema requirements:

General Requirements

Organisms. Data MUST be from a Metazoan organism or SARS-COV-2 and defined in the NCBI organismal classification. For data that is neither Human, Mouse, nor SARS-COV-2, features MUST be translated into orthologous genes from the pinned Human and Mouse gene annotations.

brianraymor commented 10 months ago

Per November 27 triage, consensus to close due to potential impact on contributors.