KrishnaswamyLab / MAGIC

MAGIC (Markov Affinity-based Graph Imputation of Cells), is a method for imputing missing values restoring structure of large biological datasets.
GNU General Public License v2.0
341 stars 97 forks source link

Implementing Molecular Cross Validation #170

Open dburkhardt opened 5 years ago

dburkhardt commented 5 years ago

Is your feature request related to a problem? Please describe. Currently, MAGIC tends to oversmooth data when using automatic t selection and graph fitting parameters.

Describe the solution you'd like Implement Molecular Cross Validation (https://www.biorxiv.org/content/10.1101/786269v1)

Additional context Basic code flow:

  1. Split the counts in each cell into a x1 and x2 (non-overlapping disjoint sets)
  2. Build the graph
    1. library size normalize x1
    2. PCA
    3. Build the graph with a given knn and t
    4. Create the diffusion operator, D
  3. Apply the diffusion operator to the library size normalized x1
  4. Multiply D(libnorm(x1)) by the library sizes of x2
  5. Calculate poisson loss
    • λ - kln(λ)
  6. Repeat for various k and t
scottgigante commented 5 years ago

@dburkhardt I have some thoughts / materials on this courtesy of @batson and @jamestwebber. Happy for you to actually implement it of course :)

MAGIC Sweep: https://github.com/czbiohub/molecular-cross-validation/blob/master/src/molecular_cross_validation/scripts/magic_sweep.py

Similar, for a diffusion model: https://github.com/czbiohub/molecular-cross-validation/blob/master/src/molecular_cross_validation/scripts/diffusion_sweep.py

jamestwebber commented 5 years ago

Talked to @dburkhardt about this today while he was here. The magic_sweep is the most directly applicable script for this but I'll throw in the newly-added mcv_sweep module and Grid Search vignette notebook as additional resources.

The GridSearchMCV class should work with a little plumbing, but it can't do anything clever with caching and so it'll be a lot slower than a more carefully engineered solution.