RGLab / CytoML

A GatingML Interface for Cross Platform Cytometry Data Sharing
GNU Affero General Public License v3.0
29 stars 14 forks source link

Get indices of gated events from a FJ workspace / alternative to flowjo_to_gatingset #140

Closed Close-your-eyes closed 2 years ago

Close-your-eyes commented 2 years ago

Hi there,

here is the case: I would like to obtain the indices (rows) of events that belong to a gated population in a flowjo workspace. The format may be a numeric vector or so.

Right now I am doing this with flowjo_to_gatingset and then some other functions. flowjo_to_gatingset is slow though for big fcs files as it writes .h5-files to disk. This h5-procedure may be required if someone wants apply ggcyto, but since I am only interested in the indices this seem like an overkill to me.

So, I have been looking for example at flowUtils in order to create filters from flowjos wsp, which then may be applied to the fcs files. But flowjos xml-format is not compatible.

Question: Is there an alternative to flowjo_to_gatingset for obtaining the indices of gated events in a flowjo wsp?

Thanks.

gfinak commented 2 years ago

The h5 files created contain your single cell data and along side all the gate definitions as well as the indices for all the cell events in those gates. At the end you get a materialization of your workspace that you can do a lot more with long term, not just visualization. For example you can work with massive data sets without loading all the files into memory. I suggest, just bite the bullet and do the conversion for the whole workspace and all the fcs files. Then pull out what you need. Keep the resulting converted gating set on hand for future use. Can you say a bit more about your end goal? What are these filters and how are you using them?

Close-your-eyes commented 2 years ago

Thank you for the response. To be honest this is rather optimization than a bug. There are also ways to do things by hand. But somehow that’s what we want to avoid?!

With filters I meant the gate-filters (ellipsoid, rectangle, …) that can be applied to a flowFrame to filter for events (rows).

I will explain what I want to do from time to time:

I have 30 FCS files with cells from different mice (WT, mutant, …) that have been stained with equal antibody panels (10 markers if you want). Each file may contain 3-5x10^6 events. Most of events are irrelevant as I am only interested in a small subset, let’s say 2x10^5 events per file. This subset is gated in FlowJo for every fcs file and a respective .wsp exists.

I want to calculate a dimension reduction with the relevant events (tsne/umap) and annotate clusters (kmeans-, leiden-, louvain-algorithm). I want to find out if any population/cluster is diminished or elevated in some mice.

To use functions from R I need a concatenated data frame of compensated fluorescence intensities from each fcs file. So, the relevant data will end up in memory anyway. I could obtain those data directly from fcs files if I only knew the respective indices.

As I said above, with flowjo_to_gatingset I can get where I want but I wondered if there is a way to avoid having the h5-files written to disk.

gfinak commented 2 years ago

It seems like you're working against yourself. Just convert the wsp and fcs to a gating set and use the cytoverse Api calls to get the cells from the relevant smaller subpopulation that you want. I think the API call is gs_pop_get_data(gs,pop). Where gs is the gating set and pop is the population name from the imported workspace.

That you don't want to write hdf5 filters to disk is irrelevant. FCS is not a random access format. It must be loaded in ram. Hdf5 doesn't have this issue, you're disk bound not ram bound. This time it's 30 files, next time it's 300. For 30 files, the time you've spent investigating this hack, the problem would have been solved and you could be doing science instead.

Whatever you're aiming to do, we've already thought about it and there's a better way to do it with the cytoverse. Read the docs, work through the examples. Even check out CytoExploreR, Dillon Hamill's excellent UI built on top of these tools.

Greg Finak

On Fri, Nov 12, 2021, 07:59 vonSkopnik @.***> wrote:

Thank you for the response. To be honest this is rather optimization than a bug. There are also ways to do things by hand. But somehow that’s what we want to avoid?!

With filters I meant the gate-filters (ellipsoid, rectangle, …) that can be applied to a flowFrame to filter for events (rows).

I will explain what I want to do from time to time:

I have 30 FCS files with cells from different mice (WT, mutant, …) that have been stained with equal antibody panels (10 markers if you want). Each file may contain 3-5x10^6 events. Most of events are irrelevant as I am only interested in a small subset, let’s say 2x10^5 events per file. This subset is gated in FlowJo for every fcs file and a respective .wsp exists.

I want to calculate a dimension reduction with the relevant events (tsne/umap) and annotate clusters (kmeans-, leiden-, louvain-algorithm). I want to find out if any population/cluster is diminished or elevated in some mice.

To use functions from R I need a concatenated data frame of compensated fluorescence intensities from each fcs file. So, the relevant data will end up in memory anyway. I could obtain those data directly from fcs files if I only knew the respective indices.

As I said above, with flowjo_to_gatingset I can get where I want but I wondered if there is a way to avoid having the h5-files written to disk.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/RGLab/CytoML/issues/140#issuecomment-967225946, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKSI6LVCOLSCALNDJGKIN3ULU2XRANCNFSM5H4YTGSA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Close-your-eyes commented 2 years ago

Okay. Thank you.

jacobpwagner commented 2 years ago

@Close-your-eyes, I know you might already be doing this, but once the gates are applied, you can pull a boolean mask for any subpopulation in any sample efficiently with gh_pop_get_indices or gh_pop_get_indices_mat. You could then save just those (or the numeric indices) out for later FCS filtering if all you are trying to do is avoid repeated loading of the GatingSet.

But one way or another, at least once you will need to load in the geometric gate definitions and apply them to the data to obtain the indices. And as Greg mentioned, the most efficient and scalable way to do that will be to let the data be managed as HDF5 instead of in memory.

However, after you've done that once, there's nothing stopping you from just saving out vectors/matrices of filter indices to apply to FCS files if you so choose. But again, as Greg said, for most cases the most efficient and scalable way to get those subsets and concatenate them will be using gh_pop_get_data/gs_pop_get_data on the GatingSet.

A basic sketch, just in case you haven't already been looking at this

library(flowCore)
library(flowWorkspace)

dataDir <- system.file("extdata",package="flowWorkspaceData")
gs_archive <- list.files(dataDir, pattern = "gs_bcell_auto",full = TRUE)
gs <- load_gs(gs_archive)

# Boolean mask
mask <- gh_pop_get_indices(gs[[1]], "lymph")
# Numeric indices
indices <- which(mask)

# Multiple populations (a matrix column for each)
mask_matrix <- gh_pop_get_indices_mat(gs[[1]], c("CD3", "CD19"))
# Converted to a list of indices for each pop
indices_multi_pops <- apply(mask_matrix, 2, which)