Interest in adding in methods for identifying a representative subsample of cells?

jmschrei commented 3 years ago

Describe the problem that your feature request would address. Single-cell data sets are getting very large, even after filtering. Some might find it helpful to be able to reduce the size of these data sets while preserving the information in them by getting rid of redundant cells.

Describe the solution you'd like Submodular optimization is a general framework for selecting a representative subset of items from a collection. It is widely used, but last year was used to select a sketch of cells from single-cell data sets (https://dl.acm.org/doi/pdf/10.1145/3388440.3412409). Users might find it useful to have an optional step for selecting a subset of cells from their processed files.

Describe alternatives you've considered GeoSketch (http://cb.csail.mit.edu/cb/geosketch/) is another alternative method for selecting a representative subset of items.

Additional context I've developed a Python tool, apricot (https://github.com/jmschrei/apricot), that implements general purpose submodular optimization (and was used in the paper above). I'd be happy to talk with you about integrating it into ArchR.

rcorces commented 3 years ago

This is an interesting idea. Though I will say that we've gone to great lengths to build ArchR to enable analysis of millions of cells. This is, in fact, one of its greatest selling points. Average-sized datasets right now are in the 100,000 cells range which can be relatively easily processed on a standard laptop so I dont feel that this is a huge issue with ArchR.

Our bandwidth for enhancements is pretty thin at the moment but if you have a vision for how this would work, let us know. I think the easiest thing would be to enable users to take an output from ArchR, run it through apricot, and apricot spits out the barcodes to keep and then users go back to ArchR and use subsetArchRProject() to subset to the "representative" cells. I think we are probably hesitant to introduce more dependencies in ArchR at the moment, especially non-R packages. Solutions that require minimal changes on our end are more likely to be implemented.

rcorces commented 3 years ago

and thank you for using the issue template! If only we could get everyone to do the same!

jmschrei commented 3 years ago

I agree that ArchR has done a great job of being scalable to large numbers of cells. Certainly, summarization is not be required for every data set or analysis. I can imagine, though, that some analysis tasks that require pairwise computation might benefit from this type of summarization.

I was thinking along the same lines as you as to the integration. If you think it would be helpful, I can provide a Python script that loads a nh5 file, reads a data matrix from a specified path, and saves a list of barcodes to an output path in the same h5 file. I also agree that dependencies can be a nightmare cross-language. Maybe you can say this is a feature that requires users to install Python themselves? I'd be happy to write a short guide on how to install Python -- either for the ArchR documentation itself, or more readily available in the apricot documentation.

rcorces commented 3 years ago

I think that type of solution would work but I defer to @jgranja24 . Especially with making modifications to the h5 file. We're also working on solutions for dependencies (containerization etc) which might make this all easier. Hopefully will have an update on that soon. Thoughts Jeff?

jgranja24 commented 3 years ago

Hi @jmschrei, I will take a look at the paper you shared and think about it a bit and comment when I am a bit more familiar with this. I understand the general use case of what you are sharing, I just want to think how this could work in the ArchR ecosystem a bit better.

GreenleafLab / ArchR

Interest in adding in methods for identifying a representative subsample of cells? #548