Irrationone / cellassign

Automated, probabilistic assignment of cell types in scRNA-seq data
Other
191 stars 79 forks source link

Intelligent subsetting of data #8

Open kieranrcampbell opened 5 years ago

kieranrcampbell commented 5 years ago

Casual user may want to pass in an entire SCE but just the rho matrix corresponding to their marker genes. In theory we should be able to detect this and appropriatley subset the SCE by matching the colnames of rho with

  1. the rownames of the SCE
  2. some field in the rowData(sce) that looks like "ID", "id", "feature_id" (looking for ensembl ids)
  3. some field in the rowData(sce) that looks like "symbol", "Symbol", "hgnc_symbol", "mgi_symbol", "feature_symbol", "entrez_id" (lookign for symbols or entrez ids)
LTLA commented 5 years ago

Chipping in here, as I'm seeing some similarities with some anti-patterns I've observed in scater.

The only legitimate subsetting approach is 1. Though 2 and 3 might seem convenient, they make it much more complicated for people to guarantee the right genes were being used. (Is the matching done based on the row names? Or did it end up using a field in rowData? In which case, which field?)

If people want to use IDs in one of the rowData fields, all they have to do is:

match(my_ids, rowData(sce)$SYMBOL)

... and supply that to a subset argument in cellassign() (for examples, see some of the refactored scater functionality for subset_rows). This is much more explicit and makes the intent of the code clearer.

You will probably want to protect against NA elements in the subsetting vector, though.

kieranrcampbell commented 5 years ago

Thanks for the input @LTLA, we'll go for this option then

LTLA commented 5 years ago

No probs. Plenty more ~opinions~ objective rules where that came from.