angelolab / ark-analysis

Integrated pipeline for multiplexed image analysis
https://ark-analysis.readthedocs.io/en/latest/
MIT License
72 stars 25 forks source link

Spatial-LDA Summary and Future Functionality #472

Closed bcollica closed 2 years ago

bcollica commented 2 years ago

Here is a brief summary of the current spatial-LDA integration into the ark pipeline and some things users should be aware of:

Notebooks: There are currently two notebooks for spatial-LDA. One is for processing and formatting data in the form of a cell table into a form compatible with the spatial_lda library from Calico. This notebook also contains some functions for calculating and plotting exploratory metrics related to parameter tuning. The other notebook runs the actual training and inference of the spatial-LDA algorithm and also supports some visualizations for inspecting the results.

Parameter Values: The two main parameters which the user needs to specify are (1) the size of the neighborhood radius in the featurization step, and then (2) the number of topics in the training step. There are functions in the processing notebook to help the user explore potential reasonable values for these parameters. Users can feel free to adjust the difference penalty as well, but this has not been shown to make too great a difference in the end result.

Parallel Processing and Improving Runtime: The other important value users can specify is the number of parallel processes to use locally which will make a difference in runtime. Users have the option to also change the primal_tol and threshold arguments during training in order to cut down on erroneous looping, but users are encouraged to test this out on a subset of data before making any final decisions regarding the models trained on their particular samples.

Some thoughts about further runtime improvements are to use a C-based solver from the CVXPY library to possibly replace the admm.py module. Another thought is to cython-ize admm.py.

Additional Visualizations and Metrics: A useful plot which is not currently implemented would be a scatter plot of the cell locations in an FOV (similar to that of plot_fovs_with_topics()), but shades the color of the points based on the probability of that cell belonging to a given topic. It may also be worth adapting the current code for the gap statistic in processing.py to work with a trained or complete spatial-LDA model since this metric can be applied to almost any clustering technique. (While spatial-LDA is not a clustering technique in the traditional sense, these hierarchical probability models can be generalized as mixture models similar to the way k-means can be.)

Next Steps: The most up-to-date branch in the spatial-LDA pipeline is LDA_Training_and_Inference. As it is now, this branch could likely be merged into the spatial_LDA branch and then merged into master unless people wish to implement any of the above mentioned (or other) changes.

Both notebooks are functional and have been tested using real data. The notebooks in the LDA_Training_and_Inference branch have been tested on a subset of 8 FOVs which contain a total of 10,740 cells. Using a 2020 MacBook Air with 8 available CPUs, 16GB RAM, and 250GB disk space, I was able to train a spatial-LDA model on 75% of the data (8,055 cells) with 4 CPUs in about 15 minutes. Running inference on all 10,740 cells took about 7 minutes with 4 CPUs.

ngreenwald commented 2 years ago

Closed by #437