AllenInstitute / scrattch.mapping

Genearlized mapping scripts for RNA-seq and Patch-seq data
10 stars 2 forks source link

Error corrections and requests for mapping algorithms #56

Open jeremymiller opened 1 month ago

jeremymiller commented 1 month ago

I'm compiling a list of several error corrections and requests for mapping algorithms into a single issue. These are sorted from hardest/lowest priority (# 1) to easiest/highest priority (# 3), although it would be ideal if we could address them all at some point:

  1. Tree mapping: this algorithm does not seem to provide as "good" of results as used to several years ago. This could be due to differences in the data sets tested (e.g., I primarily use human patch-seq data while this was designed using mouse), or differences in how the tree is constructed (e.g., I define variable genes differently from Zizhen), or could be an issue in the code somewhere. It would be helpful if someone could go find a version of the code from a few years ago and then run the SAME data through the algorithm then and now and see if we get the same results. If we get different results, then we should update the code to match what it used to do.
  2. Seurat mapping: this algorithm does not work right now. I suspect this may be due to some major changes in Seurat between version 3 or 4 (for which this was written) and the current version (v5) which breaks back-compatibility. There are a few path's forward: (1) specify an older version of Seurat in the docker, (2) update the code to use older versions of Seurat functionality (e.g., there is a variable called something like "version" that you can call in the function when you originally build a Seurat object which fixes back-compatibility) or (3) update the mapping code to reflect the current Seurat clustering/mapping/integration pipelines. I'd recommend option # 3.
  3. HANN (and correlation) mapping: These algorithms work great. We have had requests, however to retain the matrix of bootstrap probabilities for all cell types, not just the top hit, just like we currently do in the tree-mapping algorithm (we don't need to retain probabilities for internal nodes, since there aren't any, nor subclasses, since that could be inferred). My understanding is that this is generated already in Scott's code, it's just not saved anywhere in the output, so hopefully this change is just a line or two.
gouwens commented 1 month ago

Thanks for putting this together! I think @xuehanci has basically done the first part of (1) already - testing the original code vs the new code (she's on vacation this week and next, though). I believe one of the differences is that the current scrattch.mapping/scrattch.taxonomy only can use binary dendrograms, while the original code and taxonomies had dendrograms that could have multiple children per node (see https://github.com/AllenInstitute/scrattch.taxonomy/issues/15).

She has been working on implementing that change but has run into a few places where a multi-child dendrogram causes errors in other places (I believe the hclust package is used in a couple instances to manage taxonomy dendrograms, and that doesn't accept multi-child dendrograms). So it's taken longer to fix that difference between the two code bases, which we'd need to do before identifying if there are other relevant differences (like with marker gene selection). But I think it is a pretty key difference to reconcile.