[QUESTION] Using a directory to serve multiple datasets in a K8s deployment?

dmaljovec commented 3 years ago

👋 Hi all. Cool project!

I was recently turned on to your efforts because one of our data science teams is looking to use the tool internally to view our own data. I would love to take the docker image you provide and deploy it into a kubernetes cluster, but the current workflow looks like it is run with a single dataset on load. Our team wants to be able to view a large number of data files, and so it would be cool if we could dynamically switch datasets without reprovisioning the deployment.

I know about cellxgene-gateway, but I am worried that is not compatible with our use case. If we have a large number of data files we want to be accessible, I would be worried that this would spin up a lot of local flask apps on a fixed set of resources when most datasets are not being actively used all of the time. We considered adjusting that codebase to dynamically spin up/down k8s deployments rather than locally create new flask apps, but that still feels like a heavyweight solution. Something more lightweight and nimble like a user browses to a running instance of cellxgene with a query parameter that indicates the data they are viewing feels much more aligned with the use case we would want to support.

While poking around we stumbled on this bit of code and I was wondering if there was any timeline or thing we could do on our end to help make this feature generally available? I'd be happy to contribute however makes sense to help this happen.

Thanks in advance for the awesome project and any advice you can offer!

cc: @ian-quigley and @joncirish

dmaljovec commented 3 years ago

To clarify the linked bit of code is this block:

    @click.option(
        "--dataroot",
        default=DEFAULT_CONFIG.server_config.multi_dataset__dataroot,
        metavar="<data directory>",
        help="Enable cellxgene to serve multiple files. Supply path (local directory or URL)"
        " to folder containing H5AD and/or CXG datasets.",
        hidden=True,
    )  # TODO, unhide when dataroot is supported)

MaximilianLombardo commented 3 years ago

Hey Dan,

Thank you for your question and for your interest in cellxgene. So in general, the cellxgene application only has access to a single dataset at a time, meaning that you would need to spin up multiple instances of the application in order to view multiple datasets (as you have mentioned already).

The piece of code that you have found is actually a part of the backend for our hosted version of the cellxgene application, which includes a data portal that hosts multiple datasets, each with their own hosted cellxgene explorer link that is specific to that dataset. This sounds a little bit more like the solution you are looking for. The code for the data portal can be found here, which may be useful for your case (caveat we have not built this with reuse specifically in mind). This probably does not apply to your group, but there is also the option of submitting your data to our data portal while keeping it private (just throwing it out there, but I imagine you have strict constraints around data sharing).

The other option you might look into is that one of our users had created a solution using cellxgene gateway and apache2 reverse proxy that allows them to host multiple datasets with different access by different user groups you can find that solution here.

Let me know if you have any other questions on the topic. If you happen to develop a hosting solution you can share publicly, we'd love to feature it in our documentation (where we are crowdsourcing different self-hosting approaches from the cellxgene community)

Cheers,

Max

signechambers1 commented 3 years ago

Closing this issue, @dmaljovec let us know if you have additional questions!

chanzuckerberg / cellxgene

[QUESTION] Using a directory to serve multiple datasets in a K8s deployment? #2187