chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
84 stars 20 forks source link

Which columns should be included in var? #1301

Open ivirshup opened 1 day ago

ivirshup commented 1 day ago

Which columns should be made accessible in the var dataframe in census?

Right now, census only has a subset of the columns specified in the cellxgene schema. In census, there are:

While the cellxgene schema for var has a few columns that I think would make sense to include in census, but aren't. These columns are:

Is there a reason that we aren't including these already? And should we go ahead and start including them now?

cc: @sidneymbell @MaximilianLombardo

sidneymbell commented 1 day ago

Thanks, Isaac.

I'm not aware of the context for the original decision(?) to leave these out of Census while requiring them for Cellxgene. @brianraymor may have more context here.

Do we have any documented user requests for these fields?

Taking a brief look at the schema, my initial suggestion is that it's probably OK that we don't provide these in Census:

ivirshup commented 1 day ago

For some context, when I was talking about what was needed for the 5.2 schema bump, it was mentioned I should add the new feature_type column to var. But when I actually went to do it I saw there were a bunch of similar columns missing.


Looks like feature_type (i.e., gene vs spike in) is already encoded in feature_name bc we require spike ins to have (spike-in control) appended to the name. This field probably isn't required from a user perspective.]

My understanding from the schema is that our feature_type maps to Ensembl's biotype which can have values like "protein coding" or "Pseudogene".

I grabbed 5a37d48e-264b-4f09-b284-49e5f7bd3eaa, which has these entries for that column:

In [6]: a.var["feature_type"].value_counts()
Out[6]: 
feature_type
protein_coding                        17013
lncRNA                                 8245
transcribed_unprocessed_pseudogene       31
transcribed_unitary_pseudogene           14
artifact                                 11
IG_C_gene                                10
TR_C_gene                                 6
IG_V_gene                                 5
TR_V_gene                                 5
transcribed_processed_pseudogene          3
IG_C_pseudogene                           1
processed_pseudogene                      1
Name: count, dtype: int64

Based on this, would you think we should include it?


For the other two columns I think it could make sense to not include them on the basis that they don't really add anything in this context.

However, I do think there is a case to be made for just having the same info in both because it's harder to maintain two schema's than just one.