ImmuneDynamics / Spectre

A computational toolkit in R for the integration, exploration, and analysis of high-dimensional single-cell cytometry and imaging data.
https://immunedynamics.github.io/spectre/
MIT License
56 stars 21 forks source link

Bug fixing spectre 0.5 #51

Closed mabar1 closed 3 years ago

mabar1 commented 3 years ago

Hi @tomashhurst I thought its easiest to open here a new issue to point towards some minor bugs I came across when working with spectre 0.5.

  1. the first I noticed is after removing spectre and spectreMAP packages and re-installing spectre spatial branch via install_github("immunedynamics/spectre", ref = 'spatial')’ R reports

    √  checking for file 'C:\Users\MaBa\AppData\Local\Temp\Rtmp0y7yOX\remotes3d1076f41d5d\ImmuneDynamics-Spectre-dffb16f/DESCRIPTION' ...
    -  preparing 'Spectre': (3.2s)
    √  checking DESCRIPTION meta-information ... 
    -  checking for LF line-endings in source and make files and shell scripts
    -  checking for empty or unneeded directories
    -  building 'Spectre_0.4.7.tar.gz'

    so I asked myself if the spatial branch is fetching the proper version here?

  2. I just went the multicut route, so I cannot say anything abt the other protocols. When exporting the tiffs from MCD viewer I tried both, 16 and 32bit ome.tiffs and both work. Maybe you can add that comment for others (just suggestions here :-) ) The "process TIFF" script works perfect, the protocol is missing a comment about the merge.channel step, but one figures that out. I like this feature actually a lot since it allows to merge channels that will help you detect the borders. Very nice idea.

  3. During the multicut workflow in Ilastik there is a minor bug of ilastik. when using the border mask for segmentation ("multicut_2_segment"), ilastik 1.3 does not allow to load multiple datasets (see https://forum.image.sc/t/unable-to-load-multiple-datasets/38480/2), so one has to do that time-consuming step for every ROI. On the protocol it seems that you were able to load all .h5 files, at this point its not clear if you load the cropped or the raw files. Since the file above in the printscreen reads e.g. "ROI002-sta...bilities.h5", I assumed this means stacked and hence the cropped (the R script adds _crop to the filename. Since I could not process them as batch (nor could I use the crop as training and then batch process its raw file), I had to do the multicut on the raw files- one by one - which took quite a while..

  4. I then went to simple analysis. The R-script "Simple spatial 1 - add masks" , or better the function do.add.masks requires the hdf5r package which is not preloaded (which is easily corrected by just add the library in the script). but it searches for a mask that is not generated:

    Loading required package: hdf5r
    Adding masks to cell.mask
    -- processing ROI005_BeCeFa_20201111_Pan_ROI_1
    Error in readTIFF(paste0(i, mask.ext)) : 
    unable to open ROI005_BeCeFa_20201111_Pan_ROI_1_Cell_mask.tiff
    In addition: Warning message:
    In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE,  :
    there is no package called ‘hdf5r’

I cannot find this extension in any of the protocol so I assume this is a left-over from the old script? the script searches first for the masks by correctly scanning for:

all.masks <- list.files(pattern = '.tif') 

but then filters for

cell.mask.ext <- '_Cell_mask.tiff'

edit: Am I correct, the scripts expects here the cell_mask.tiff from the cellprofiler nuclear pixel extension protocol, no?

mabar1 commented 3 years ago

since I wanted to proceed with the analysis, I went and processed the ROIs with the nuclear pixel expansion method and fed these masks into the "Simple spatial 1 - add masks" script. However, the do.create.outlines function still produces the error we saw before:

>         spatial.dat <- do.create.outlines(spatial.dat, 'cell.mask')
Creating polygons, outlines, and centroids using 'stars' method.
Processing masks for ROI ROI_1
although coordinates are longitude/latitude, st_union assumes that they are planar
Error: Problem with `summarise()` column `geometry`.
i `geometry = sf::st_union(geometry)`.
x Evaluation error: TopologyException: side location conflict at 1315 67.
i The error occurred in group 1: TEMP_MASK = 0.
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning message:
In st_is_longlat(x) :
  bounding box has potentially an invalid value range for longlat data

edit: using another mask, such as cell.mask.ext <- '_Nuclei_mask.tiff' creates the same error, but the conflict happens at a different side:

Evaluation error: TopologyException: side location conflict at 17 311.

edit 2: switching from stars to raster do.create.outlines(spatial.dat, 'cell.mask', method = "raster") does not cause this error, but R simply runs out of RAM (32GB, R used 98% of it):

Creating polygons, outlines, and centroids using standard method -- this step may take some time, please be patient
Processing masks for ROI ROI_1
Error : memory exhausted (limit reached?)
mabar1 commented 3 years ago

Hi @tomashhurst sorry for posting so often. I just try to find a way to understand spectre..

I switched to the Advanced spatial workflow which is under development, but i just wanted to see how far I get with the multi-cut cell segmentation. Using these masks and the "Advanced spatial 1 - add masks" script, do.create.outlines function works: I could add all three masks and create outlines for two of the three masks:

spatial.dat <- do.create.outlines(spatial.dat, 'cell.mask') 
spatial.dat <- do.create.outlines(spatial.dat, 'cell.type') 
spatial.dat <- do.create.outlines(spatial.dat, 'region')   # <- this fails with same error in summarise() as above

and could also create the mask plots via make.spatial.plot.

However extracting the signals using the cell.mask, do.extract(spatial.dat, 'cell.mask') stops after segmentation of the channel signals:

(extracting all channels without error..)
... X208Pb_208Pb.omef
|================================================================================================================================================| 100%
... occurance in cell.type
... occurance in region
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘proj4string’ for signature ‘"NULL"’

Is this an error due to the fact that I did not manage to properly create an outline of the region?

tomashhurst commented 3 years ago

Hi @mabar1 , sorry for the delayed reply! I think I've got an idea of what's going wrong for this run -- I'm having a look at the code and will get back to you shortly!

tomashhurst commented 3 years ago

Hi @mabar1 , I have some updates for you:

  1. Package version: v0.5.2 should have fixed a lot of installation issues AND spatial function issues, so try reinstalling Spectre and give it a shot.
  2. The 16 and 32 bit options is actually a very good point and I hadn't considered that, but I definitely will! @Felixillion FYI.
  3. That Ilastik error is concerning -- I am using v1.3.3 and the multiple datasets are working fine. Could you send a screenshot of what goes wrong when you are trying to load the HDF5 files?
  4. I have adjusted the functions that read in TIFF files, as well as the workflows, so that it should correctly find the files -- give it another try with v0.5.2 and let me know if it works for you. (in terms of just creating the lists of files, '.tif' still shows any files that end in '.tiff' which is helpful, but when it's done inside the read.spatial.files function it messes up the process.
  5. The do.create.outlines errors you mention above I was working on today, and that should work OK now. I'd like to know if you are able to get through this OK, and what operating system you are using -- there might be an additional non-R requirement on Mac.

Let me know how that all goes for you, hopefully it should be sorted out!

tomashhurst commented 3 years ago

@mabar1 I may have spoken too soon -- there is still an issue with of sf and s2 are working -- fixed on my mac but not on the windows system in Spectre v0.5.2. Will have another look tomorrow!

mabar1 commented 3 years ago

@tomashhurst , ok no problem. Ill wait with loading spectre then. If you update, using install_github("immunedynamics/spectre", ref = 'spatial')’ this spatial branch is going to bring the updated version, or should i also update spectre from the immunedynamics repo? I just ask bc last time R built 'Spectre_0.4.7.tar.gz' (see first post)?

tomashhurst commented 3 years ago

I would just download from the main repo install_github("immunedynamics/spectre"). At the moment it would be v0.5.2 but after the next update it will be v0.5.3.

mabar1 commented 3 years ago

ok, yes I see that the spatial_workflows are now in the spectre repo as well. So I will use these scripts in there once 0.5.3 is online. thanks for the clarification

tomashhurst commented 3 years ago

@mabar1 I think it's been resolved -- use install_github("immunedynamics/spectre") to install the latest version (v0.5.3) and let me know how it goes.

tomashhurst commented 3 years ago

@mabar1 v0.5.4 has some additional installation fixes.

mabar1 commented 3 years ago

@tomashhurst I made a clean install of v0.5.4 and let R update all dependencies. In "Advanced spatial - 1 add masks" the only adaptation needed was read.spatial.files when importing the ROI. A switch to ext = ".ome.tiff" and it ran through. do.create.outlines works for cell.mask, cell.type and region now. In "Advanced spacial - 2 cellular analyis" took me a while to find out how cell.type.annot and region.annot works, but I got that eventually. When I process the data, I use asinh with a factor 5 since CyTOF folks told me to. And I use z-score normalization rather than rescale (again, CyTOF folks told me to). But apart from that the first two scripts run. Thanks for that! Its too late to try the third script, "cluster annotation", so Ill leave that for later.

tomashhurst commented 3 years ago

@mabar1 I'm hoping to have the online protocol for running through those written up properly during the week, which should help quite a bit!

mabar1 commented 3 years ago

no worries @tomashhurst Im fully aware that things are under construction. I feel like that guy that shows up to the party half an hour too early and then constantly asks where the chips are and why the snacks are still in the fridge. The thing that I have to have a look at now is why I cannot batch process the multicut in ilastik. Ill make a screenshot and write you a mail, ok? Then Ill downgrade from 1.3.3post3 and check. Because loading the whole ROI, 2000x2000px for a watershed is not the way to go: its all too small to control the segmentation and besides, ilastik crashes all the time. Ill have a look at that today. Thanks for your help again

mabar1 commented 3 years ago

So I up-scaled the data that spectre has to deal with and run into a problem in the read.spatial.files function. This function regularly crashes out with

Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'is.factor': cannot allocate vector of size 29.6 Mb

Given that it happens while reading the tiffs and at random tiff markers, but generally towards the end of reading the tiffs of a ROI, I thought that it might help to store chunks of ROIs into a temp.spatial.dat and append afterwards. So I split the rois into chunks of 5, read them in separately, then just appended that temp.spatial.dat to the big one and repeat over rois. That helped to minimize the error for a while. But now that I got 51 ROIs in, the error persists.

Now the thing is that if I store the ROI that just caused the error into an empty spatial.dat file, it imports just fine. Also, if I import the last few ROIs (say 41rd to 51st) into an empty spatial.dat: no error. But if I then import from 1:51st, at ROI 43 it crashes out. And sure enough: If I revert the rois (meaning I read in the last in the list first) the function works now with the ones that caused the error in the first place but R crashes with the ROIs that worked fine before. So Im pretty sure this is a big-file handling problem or R.

edit: it was R running out of RAM. I simply increased mem size from 32603 MB to memory.limit(size=48000) and now the script ran through in 4.8h

tomashhurst commented 3 years ago

@mabar1 nice one! Yes that RStudio enforce vector limit thing can be pretty weird. So that was the only problem?

I've been reflecting on some better management and we have designs on managing more the spatial stuff using HDF5 (so the data is kept on disk during analysis). We have a working test version for cytometry analysis, but the spatial stuff might take a little longer to get it working.

mabar1 commented 3 years ago

I read in one comment that increasing mem size is dangerous, but I had not noticed any side effects yet. Plotting in Advanced Spatial 2 needs a lot of RAM, so I had to increase it there even further.

In the meantime I adapted the script to our purpose and also run phenograph and UMAP. and since the spatial.dat, or rather the cell.dat object are so nicely behaving objects, its easy to include more data in there. There are some minor things, such as a missing save.to.disk=F option in the make.pheatmap function. But this minor stuff.

But you are right, switching to hdf5 is a neat way. I processed the 51 ROIs just with the pixel classification routine, so I produced only tiff masks in CP3 (I had a well trained model in ilastik, so I went with that routine to handle this amount of ROIs). Meaning that I cannot easily create the region masks that the analysis asks for. But again, its rather straightforward to deactivate these steps in Advanced spatial 3. There is a bug in the run.spatial.analysis function, watch out: In order to work without the area.table mask, one has to set counts.per.area == FALSE. This parameter is pre-set in the function, but the code does not have this condition. So I had to re-define the function anew and add:

               if (counts.per.area == TRUE) { # <- this was missing, so if you dont have the area mask, it will fail.
                message(" -- Calculating counts/area")
                reg.res.by.area <- reg.res
                for (u in regions) {
                    ar <- area.table[area.table[[sample.col]] == i, 
                                     u, with = FALSE]
                    ar <- ar[[1]]
                    ar <- as.numeric(ar)
                    reg.res.by.area[[u]] <- reg.res.by.area[[u]]/ar * 
                        10000
                }
                reg.res.area.long <- melt(setDT(reg.res.by.area), id.vars = c("POPULATIONS"), 
                                          variable.name = "REGION")
                reg.res.area.long
                reg.res.area.long.new <- data.table()
                reg.res.area.long.new$measure <- paste0("Cells per 100 sqr. pixels of ", 
                                                        reg.res.area.long$REGION, " -- ", reg.res.area.long$POPULATIONS)
                reg.res.area.long.new$counts <- reg.res.area.long$value
                reg.res.area.long.new <- dcast(melt(reg.res.area.long.new, 
                                                    id.vars = "measure"), variable ~ measure)
                reg.res.area.long.new$variable <- NULL
                }

and a bit further down the function:

  sample.all <- cbind(samp.front, 
                                    reg.res.long.new, 
                                    #reg.res.area.long.new, # off since we dont have area.table masks
                                    a.res.long.new, 
                                    b.res.long.new, 
                                    dist.res.list, 
                                    adj.res.list)

Again, its straight forward once you dig into the code. nice job!