Tissue bounding box selection

Lioscro commented 2 years ago

Add support for selecting (either manually or automatically) a bounding box for the entire tissue on a slice. Should use the total UMI counts per pixel, as returned by io.read_bgi_agg (in the refactor branch currently).

Selecting a bounding box will help with computational and memory requirements by just focusing on the relevant pixels.

We will also have to figure out how to have this bounding box integrate smoothly with all downstream processes (for instance, need a way to use this same bounding box for cell segmentation / binning and ultimately reading the full count matrix, and do so as intuitively as possible).

We also need to be careful in figuring out how images will be handled when using a bounding box.

Xiaojieqiu commented 2 years ago

@Lioscro I will write the function to obtain the convex hull of all nanoballs that have non-zero UMI (or at least > x UMI) and then save the convex hull as a key in .uns. We can then use this for cell segmentation and binning by selecting only pixel (in image) or bin that is in this convex hull.

Does this sounds good to you?

Lioscro commented 2 years ago

I'm not sure if saving into .uns is the best approach here. We will be selecting the bounding box using the total UMIs (a 2D Numpy array containing total UMI counts for each pixel), not an AnnData object. What I am envisioning is the following.

Load sparse matrix of total UMI counts with io.read_bgi_agg.
Detect bounding box (xmin, xmax, ymin, ymax) from the above sparse matrix.
Pass in bounding box (xmin, xmax, ymin, ymax) to io.read_bgi to create an AnnData containing only the pixels in the bounding box. This function will also do any binning or cell segmentation. To generate a bins x genes or segmented cells x genes AnnData.

We also need a place to keep track of the coordinate system, so .uns is where I'm thinking that information should go.

Lioscro commented 2 years ago

It would also be nice to have a manual selection process.

Xiaojieqiu commented 2 years ago

Ok, I liked your thinking about using io.read_bgi_agg and passing the bounding box to read_bgi to create an AnnData. So in the bounding_box branch I added a function to 1) detect alpha concave hull of all non-zero aggregate UMI counts and 2). returned this hull as a Polygon object. We can then check 3) whether the coordinates are in this hull and ignore any nanoball that are outside this hull when creating the adata object.

Manual selection can be done via lasso method, but should be more downstream after we did some analysis and want to zoom into a particular region for in-depth analyses.

Lioscro commented 2 years ago

How sensitive is the detection? For instance, if we have a single ambient (background) UMI captured far away from the actual tissue, will this method end up including that and all the empty spots in between?

I agree manual selection will probably be most useful in that case, but it should be implemented modularly enough that the same functions may be called to manually choose a bounding box from the output of io.read_bgi_agg with minimal new code. One particular case I think this may be useful is if a slide ends up containing multiple different samples (just due to how large each BGI slice is). The user may want to manually select each sample separately and prepare separate AnnDatas for each sample from the same slide.

Lioscro commented 2 years ago

One more thing: we should keep information about the bounding box (either as a rectangle or polygon, doesn't really matter which) relative to the original coordinate system somewhere in the AnnData metadata (this is where I was thinking putting info in the .uns may be useful).

This information may be useful down the line when we need to integrate any additional information into the AnnData. For instance, say we want to add a staining image to the AnnData. We need to crop this image so that it matches with the view of the slice that the AnnData is concerned with.

Xiaojieqiu commented 2 years ago

re: How sensitive is the detection? For instance, if we have a single ambient (background) UMI captured far away from the actual tissue, will this method end up including that and all the empty spots in between?

This method can detect multiple concave hulls (the result will be a multi-polygon object in shapely). So in this case, it will treat those separate spots as a separate polygon. In addition, we have the option to filter nanoball first based on the min_umi_num argument which hopefully removes those background RNA spots.

I agree manual selection will probably be most useful in that case, but it should be implemented modularly enough that the same functions may be called to manually choose a bounding box from the output of io.read_bgi_agg with minimal new code Yeah, there are a few ways to achieve this, matplotlib, datashader all support picking up points from the users interactively to enable the manual selection. This is also related to the new issue #18

One particular case I think this may be useful is if a slide ends up containing multiple different samples (just due to how large each BGI slice is). The user may want to manually select each sample separately and prepare separate AnnDatas for each sample from the same slide. The alpha concave hull approach should be able to identify the hulls for those different samples and you can use it to separate data for each sample. It doesn't a nice job and give better results than just using Lasso.

we should keep information about the bounding box (either as a rectangle or polygon, doesn't really matter which) relative to the original coordinate system somewhere in the AnnData metadata (this is where I was thinking putting info in the .uns may be useful).

Good point, I am thinking about to have a mechanism of a coordinate system for the spatial data in spateo. This system will be keeped in .uns every time we manupulate the spatial coordinates (affine transform, 3D align, for example). I like the crs (coordinate reference system) from geopands (https://geopandas.org/en/stable/docs/user_guide/projections.html). We may look into this.

Lioscro commented 2 years ago

This method can detect multiple concave hulls (the result will be a multi-polygon object in shapely). So in this case, it will treat those separate spots as a separate polygon. In addition, we have the option to filter nanoball first based on the min_umi_num argument which hopefully removes those background RNA spots.

The concern I have with this approach (and I'm not sure if it'll be an actual problem) is that the number of UMIs per pixel is already very small. I calculated median 3 UMIs per spot for the BGI test data. Even with a threshold of 1 UMI per spot, I feel like we run a large risk of filtering out real signal.

Xiaojieqiu commented 2 years ago

in my experience, there are no cases in which some nanoballs with RNA are far away from the actual specimen because the RNA per ball is already so low. so practically it is not an issue. Theoretically, the multi-polygon detection capability can be also helpful because it won't include the empty space between the outliers and the true data but keep the outliers and the actual specimen domain separate concave hulls.

as another note, we also need to support aggregation in the io.read_bgi_agg because with large slides, even the sparse matrix for this UMI aggregation is too huge.

Lioscro commented 2 years ago

as another note, we also need to support aggregation in the io.read_bgi_agg because with large slides, even the sparse matrix for this UMI aggregation is too huge.

What do you mean by aggregation? Binning?

Xiaojieqiu commented 2 years ago

yeah, binning. Or think of it as visualizing the data with lower resolution. I will work on this tomorrow and this weekend

aristoteleo / spateo-release

Tissue bounding box selection #10