Open ivirshup opened 1 day ago
Thanks, Isaac.
I'm not aware of the context for the original decision(?) to leave these out of Census while requiring them for Cellxgene. @brianraymor may have more context here.
Do we have any documented user requests for these fields?
Taking a brief look at the schema, my initial suggestion is that it's probably OK that we don't provide these in Census:
We standardize the value of feature_reference
based on the organism
, so the value would always be the same based on organism
. Probably isn't required from a user perspective.
Looks like feature_type
(i.e., gene vs spike in) is already encoded in feature_name
bc we require spike ins to have (spike-in control)
appended to the name. This field probably isn't required from a user perspective.
The value of feature_biotype
is always gene
or spike-in
. Given that spike-ins can be readily identified based on feature_name
, I'd say it's probably OK to leave this out.
For some context, when I was talking about what was needed for the 5.2 schema bump, it was mentioned I should add the new feature_type
column to var
. But when I actually went to do it I saw there were a bunch of similar columns missing.
Looks like feature_type (i.e., gene vs spike in) is already encoded in feature_name bc we require spike ins to have (spike-in control) appended to the name. This field probably isn't required from a user perspective.]
My understanding from the schema is that our feature_type
maps to Ensembl's biotype
which can have values like "protein coding"
or "Pseudogene"
.
I grabbed 5a37d48e-264b-4f09-b284-49e5f7bd3eaa
, which has these entries for that column:
In [6]: a.var["feature_type"].value_counts()
Out[6]:
feature_type
protein_coding 17013
lncRNA 8245
transcribed_unprocessed_pseudogene 31
transcribed_unitary_pseudogene 14
artifact 11
IG_C_gene 10
TR_C_gene 6
IG_V_gene 5
TR_V_gene 5
transcribed_processed_pseudogene 3
IG_C_pseudogene 1
processed_pseudogene 1
Name: count, dtype: int64
Based on this, would you think we should include it?
For the other two columns I think it could make sense to not include them on the basis that they don't really add anything in this context.
feature_biotype
would always be "gene"
.feature_reference
just points to the organism, and not the specific reference genome, it will always be the same value within a dataset.However, I do think there is a case to be made for just having the same info in both because it's harder to maintain two schema's than just one.
Which columns should be made accessible in the var dataframe in census?
Right now, census only has a subset of the columns specified in the cellxgene schema. In census, there are:
feature_id
feature_name
feature_length
While the cellxgene schema for
var
has a few columns that I think would make sense to include in census, but aren't. These columns are:feature_biotype
feature_reference
feature_type
Is there a reason that we aren't including these already? And should we go ahead and start including them now?
cc: @sidneymbell @MaximilianLombardo