acycliq / pciSeq

A probabilistic cell typing algorithm for spatial transcriptomics.
MIT License
23 stars 7 forks source link

Running out of memory #11

Closed pakiessling closed 1 year ago

pakiessling commented 1 year ago

Hi there,

I am trying to get pciSeq 0.0.51 to run on our Merfish data, but I constantly run out of memory on a 150 GB Ram, 30 cpu job.

[2023-07-22 09:28:33] INFO    (app.py:204)  output_path is set to ['./pciseq_results']
[2023-07-22 09:28:33] INFO    (app.py:204)  Inefficiency is set to 0.2
[2023-07-22 09:28:33] INFO    (app.py:204)  InsideCellBonus is set to 2
[2023-07-22 09:28:33] INFO    (app.py:204)  dtype is set to <class 'numpy.float16'>
[2023-07-22 09:28:33] INFO    (app.py:204)  save_data is set to True
[2023-07-22 09:28:33] INFO    (app.py:204)  launch_viewer is set to True
[2023-07-22 09:28:33] INFO    (app.py:118)  Preprocessing data
[2023-07-22 09:28:33] INFO    (spot_labels.py: 66)  The labels in the label image do not seem to be a sequence of successive integers. Relabelling the label image.
[2023-07-22 09:29:03] INFO    (spot_labels.py: 69)  Number of spots passed-in: 1002301
[2023-07-22 09:29:04] INFO    (spot_labels.py: 70)  Number of segmented cells: 65527
[2023-07-22 09:29:04] INFO    (spot_labels.py: 71)  Segmentation array implies that image has width: 14485px and height: 10422px
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=37082961.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

This is on a pretty small 2D measurement and the single cell reference is also small, dataframe 29046 rows × 11416 columns with 11 cell types. I already set dtype to float16, is there anything else I can optimize?

acycliq commented 1 year ago

I believe it is the segmentation array that causes the out-of-memory error but I think I had fixed this in a previous version. Anyway your image doesnt look that big anyway, it shouldnt happen. Can you share the data so I can have a look?

pakiessling commented 1 year ago

@acycliq I uploaded the files here https://rwth-aachen.sciebo.de/s/oBShlsFJOr0njp8

Thank you for your help

acycliq commented 1 year ago

I will send you a fix later but are you ok with your segmentation, do you really have more than 65,000 cells on that slice? I would say it looks too oversegmented to me but I guess you know your data better . Note also that the max label in your coo_matrix is 65535 which equals the max value for uint16. What I am trying to say is that maybe the cell segmentation algorithm overflows the uint16 range. It reaches label 65535 then it runs out of integers to use and loops over from the start again, ie using label 1, 2 etc again. I dont really know, thats something to highlight so you can have it in mind.

pakiessling commented 1 year ago

Thank you great point on the uint8.

We are still settling on a segmentation strategy right now. This is the default that the machine outputs and I am trying to see how pciseq / baysor as a follow-up might improve that.

Do you have any suggestions as far as segmentation algorithms go?

acycliq commented 1 year ago

Can you try this dev version: pip install pciSeq==0.0.57.dev1. I tried your data and didnt get any errors (on a simple desktop with 4cores, 64GB RAM). The error was due to the function that calcs the centroids and area of the cells. I think it is now being handled better, doesn't drop errors but it takes long to finish, around 40mins on my desktop.

You will need to massage a little bit your data. Some of the spots have to be removed because there are no data for them in the reference library and the columns, index of the refData dataframe need to be set again, see below

refData = pd.read_pickle("sc-reference.pkl")

cols = refData.columns.to_list()
idx = refData.index.to_list()
sc_ref = pd.DataFrame(refData.values, index=idx, columns=cols)

drop_spots = ['AL391650.1', 'HOTAIR', 'MHRT', 'PTPRC-exon4-6-introns', 'RGS5', 'eGFP', 'mCherry2', 'tdToma']
idx = ([i for i, v in enumerate(transcripts_pciseq.Gene) if v not in drop_spots])
transcripts_pciseq = transcripts_pciseq.iloc[idx]

opts = {'save_data': True, 'launch_diagnostics': True, 'launch_viewer': True}
cellData, geneData = fit(spots=transcripts_pciseq, coo=matrix, scRNAseq=sc_ref, opts=opts)

This will produce a lot of Zero-classed cells, around 66%, I dont know if that makes sense or you have to tweak a bit the inefficiency maybe....

With regards to your question about segmentation, we tend to use cellpose, but when I was looking at it (around version 1,0 or just before it) I think it would also overflow the 16bit ints especially in 3d. I am not 100% sure though and I dont know if that is still the case.

Hope that helps, let me know if you have any problems. Out of interest, may I ask which lab are you from?

pakiessling commented 1 year ago

Thank you so much, it runs now! I am working at the University of Aachen in Germany :)

A few more questions for my understanding:

  1. A Zero class cell represents an entity which Pciseq thinks is a cell but which it cannot assign a celltype from the reference?

  2. And the inefficency value represents how closely I suspect that spatial and single cell data match on a scale from 0-1? So increasing this value tends to produce more Zero class cells, as the algorithm needs more convincing that a celltype from the reference applies?

  3. How good does the supplied coomatrix segmentation need to be? Will pciseq fix cases of over / under segmetation based on the reference? Should I increase InsideCellBonus if I ever get to the point where I am very confident about the supplied coomatrix?

acycliq commented 1 year ago

Zero class is literally an extra column that is added at the end of your reference data with counts very close to zero. It acts as a "none from the above" class. However pciSeq is not a segmentation tool. It doesn't try to answer a question like "this is my image, tell me what shapes you can see". Segmentation is an input. Hence instead of "..thinks as a cell...." that you mention above, I would say that the point pattern around that area doesn't fit any of the classes in the ref data and the cell is assigned the Zero class.

Yes, inefficiency represents how closely your data match the single cell data but it can be any value greater that zero, not just between 0 and 1. Its main role is to bring the counts on the spatial-data side and single-cell data side at about the same level so pciSeq can pick up a class from the single cell data. Something similar to having miles vs kilometers, or EUR vs USD. It has to do with how sensitive a method is and the single cell data. There is a python notebook about some diagnostics on the Readme of the repo, have a look there, maybe it will help. It needs redis to run, I hope it works for you.

pciSeq reads the segmentation and calculates the mean radius of the cells. Then given this radius, it fits a bivariate normal distribution (with diagonal correlation matrix, ie no dependence between x and y) to assess how likely it is a spot to "belong" to a cell. The InsideCellBonus adds an extra bias to push that probability even higher if the spot is within the boundaries. If you are really confident about your segmentation then yes, push that hyperparameter even higher. There is however a branch of the code that is very relaxed about segmentation. It doesn't care about the cell boundaries at all. It starts from the cell centroids and then fits again a bivariate normal distribution but now the correlation matrix and the centroid are getting estimated on the fly. The algorithm tries to fit a distribution that fits best the point pattern around that area, something like a Gaussian Mixture Model for example. Let me know if you want me to release that.

PciSeq will not fix/correct the segmentation. It will not merge or splt two shapes and then check if the counts in the new shape make more sense or not. The branch of the code I mentioned two sentences above, maybe can be seen from this perspective (but not exactly), in the sense that it fits a normal distribution and its shape grows or maybe shrinks because the covariance matrix changes based on the points in that area. The same happens for the centroid, it gets shifted. It cannot be considered as a segmentation mechanism whatsoever.

pakiessling commented 1 year ago

Thanks you, that is quite helpful.